Jérémie Bigras-Dunberry created BEAM-12701:
----------------------------------------------

             Summary: Converting two dataframe  to_csv in the same pipeline 
causes PCollection label collision
                 Key: BEAM-12701
                 URL: https://issues.apache.org/jira/browse/BEAM-12701
             Project: Beam
          Issue Type: Bug
          Components: io-py-common
    Affects Versions: 2.31.0
            Reporter: Jérémie Bigras-Dunberry


 

If you use  the to_csv of the DeferredDataFrame twice in a single pipeline like 
this : 
{code:java}
df1 = pd.DataFrame.from_records({"a":"b"}, index=[0])
df2 = pd.DataFrame.from_records({"a":"b"}, index=[0])

with beam.Pipeline() as p:
 df1 = to_dataframe(to_pcollection(df1, pipeline=p), label="df1")
 df2 = to_dataframe(to_pcollection(df2, pipeline=p), label="df2")

 df1.to_csv("test.csv")
 df2.to_csv("test2.csv"){code}
You get this error on the second to_csv call

 
{code:java}
RuntimeError: A transform with label "ToPCollection(df)" already exists in the 
pipeline. To apply a transform with a specified label write pvalue | "label" >> 
transform

{code}

I think it comes from the fact that to_csv  is calling a  to_pcollection 
without any label, causing to infer an identical label for both to_csv function 
calls. 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to