Jérémie Bigras-Dunberry created BEAM-12701:
----------------------------------------------
Summary: Converting two dataframe to_csv in the same pipeline
causes PCollection label collision
Key: BEAM-12701
URL: https://issues.apache.org/jira/browse/BEAM-12701
Project: Beam
Issue Type: Bug
Components: io-py-common
Affects Versions: 2.31.0
Reporter: Jérémie Bigras-Dunberry
If you use the to_csv of the DeferredDataFrame twice in a single pipeline like
this :
{code:java}
df1 = pd.DataFrame.from_records({"a":"b"}, index=[0])
df2 = pd.DataFrame.from_records({"a":"b"}, index=[0])
with beam.Pipeline() as p:
df1 = to_dataframe(to_pcollection(df1, pipeline=p), label="df1")
df2 = to_dataframe(to_pcollection(df2, pipeline=p), label="df2")
df1.to_csv("test.csv")
df2.to_csv("test2.csv"){code}
You get this error on the second to_csv call
{code:java}
RuntimeError: A transform with label "ToPCollection(df)" already exists in the
pipeline. To apply a transform with a specified label write pvalue | "label" >>
transform
{code}
I think it comes from the fact that to_csv is calling a to_pcollection
without any label, causing to infer an identical label for both to_csv function
calls.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)