[ 
https://issues.apache.org/jira/browse/BEAM-12593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415798#comment-17415798
 ] 

Brian Hulette edited comment on BEAM-12593 at 9/15/21, 11:36 PM:
-----------------------------------------------------------------

I looked into BEAM-12764. The error is occurring because Dataflow workers are 
failing to unpickle DoFns created by the DataFrame API. The DoFns include 
serialized pandas dataframes, which are created with pandas 1.3.x after this 
change, but Dataflow workers are on pandas 1.2.x still. My proposed solution:
- go ahead and upgrade the Dataflow worker to use pandas 1.3.x
- re-apply pr/15165 with a patch the bumps the worker container

Note that it looks like pandas tries to maintain backwards compatibility with 
pickled dataframes [back to 
v0.20.3|https://pandas.pydata.org/docs/reference/api/pandas.read_pickle.html]. 
So having a newer version on Dataflow workers shouldn't be an issue (for 
serialization. having a mismatched pandas version could still lead to undefined 
behavior in DataFrame API operations).

Open question: Why are we creating DoFns with serialized dataframes?


was (Author: bhulette):
I looked into BEAM-12764. The error is occurring because Dataflow workers are 
failing to unpickle DoFns created by the DataFrame API. The DoFns include 
serialized pandas dataframes, which are created with pandas 1.3.x after this 
change, but Dataflow workers are on pandas 1.2.x still. My proposed solution:
- go ahead and upgrade the Dataflow worker to use pandas 1.3.x
- re-apply pr/15165 with a patch the bumps the worker container

Note that pandas tries to maintain backwards compatibility with pickled 
dataframes. So having a newer version on Dataflow workers shouldn't be an issue 
(for serialization. having a mismatched pandas version could still lead to 
undefined behavior in DataFrame API operations).

Open question: Why are we creating DoFns with serialized dataframes?

> DataFrame API: Support pandas 1.3.x
> -----------------------------------
>
>                 Key: BEAM-12593
>                 URL: https://issues.apache.org/jira/browse/BEAM-12593
>             Project: Beam
>          Issue Type: Improvement
>          Components: dsl-dataframe
>            Reporter: Brian Hulette
>            Assignee: Brian Hulette
>            Priority: P2
>          Time Spent: 13h
>  Remaining Estimate: 0h
>
> Started a WIP PR here: https://github.com/apache/beam/pull/15008 that used 
> rc1. Now the official 1.3.0 is out.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to