Brecht Coghe created BEAM-6723:
----------------------------------
Summary: Reshuffle in streaming pipeline does not work on
dataflowrunner python
Key: BEAM-6723
URL: https://issues.apache.org/jira/browse/BEAM-6723
Project: Beam
Issue Type: Bug
Components: runner-dataflow, sdk-py-core
Reporter: Brecht Coghe
When using a Reshuffle after windowing, the dataflowrunner gives the following
error:
"org.apache.beam.sdk.transforms.windowing.IntervalWindow cannot be cast to
org.apache.beam.sdk.transforms.windowing.GlobalWindow"
This makes it impossible to prevent fusion of operations and to distribute our
workload nicely.
For more context, see following post where [~robertwb] linked this Jira board:
[https://stackoverflow.com/questions/54764081/google-dataflow-streaming-pipeline-is-not-distributing-workload-over-several-wor]
Quick summary: Our use case is actually video analytics. We want to use the
windowing to get small intervals of videos and the grouping to group per video
stream. The group by key is thus a game id. What comes out the window and
grouping is several windows within one game so with 1 game_id key and this does
not distribute over several workers. The proposed workaround would be to add
the window_id to the key after windowing, then reshuffle and then the
processing would be able to run in parallel per window in one videostream. Can
someone confirm this approach (if reshuffle would work).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)