[ https://issues.apache.org/jira/browse/BEAM-12593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415798#comment-17415798 ]
Brian Hulette edited comment on BEAM-12593 at 9/15/21, 11:36 PM: ----------------------------------------------------------------- I looked into BEAM-12764. The error is occurring because Dataflow workers are failing to unpickle DoFns created by the DataFrame API. The DoFns include serialized pandas dataframes, which are created with pandas 1.3.x after this change, but Dataflow workers are on pandas 1.2.x still. My proposed solution: - go ahead and upgrade the Dataflow worker to use pandas 1.3.x - re-apply pr/15165 with a patch the bumps the worker container Note that it looks like pandas tries to maintain backwards compatibility with pickled dataframes [back to v0.20.3|https://pandas.pydata.org/docs/reference/api/pandas.read_pickle.html]. So having a newer version on Dataflow workers shouldn't be an issue (for serialization. having a mismatched pandas version could still lead to undefined behavior in DataFrame API operations). Open question: Why are we creating DoFns with serialized dataframes? was (Author: bhulette): I looked into BEAM-12764. The error is occurring because Dataflow workers are failing to unpickle DoFns created by the DataFrame API. The DoFns include serialized pandas dataframes, which are created with pandas 1.3.x after this change, but Dataflow workers are on pandas 1.2.x still. My proposed solution: - go ahead and upgrade the Dataflow worker to use pandas 1.3.x - re-apply pr/15165 with a patch the bumps the worker container Note that pandas tries to maintain backwards compatibility with pickled dataframes. So having a newer version on Dataflow workers shouldn't be an issue (for serialization. having a mismatched pandas version could still lead to undefined behavior in DataFrame API operations). Open question: Why are we creating DoFns with serialized dataframes? > DataFrame API: Support pandas 1.3.x > ----------------------------------- > > Key: BEAM-12593 > URL: https://issues.apache.org/jira/browse/BEAM-12593 > Project: Beam > Issue Type: Improvement > Components: dsl-dataframe > Reporter: Brian Hulette > Assignee: Brian Hulette > Priority: P2 > Time Spent: 13h > Remaining Estimate: 0h > > Started a WIP PR here: https://github.com/apache/beam/pull/15008 that used > rc1. Now the official 1.3.0 is out. -- This message was sent by Atlassian Jira (v8.3.4#803005)