Brecht Coghe created BEAM-6723:
----------------------------------

             Summary: Reshuffle in streaming pipeline does not work on 
dataflowrunner python
                 Key: BEAM-6723
                 URL: https://issues.apache.org/jira/browse/BEAM-6723
             Project: Beam
          Issue Type: Bug
          Components: runner-dataflow, sdk-py-core
            Reporter: Brecht Coghe


When using a Reshuffle after windowing, the dataflowrunner gives the following 
error:

"org.apache.beam.sdk.transforms.windowing.IntervalWindow cannot be cast to 
org.apache.beam.sdk.transforms.windowing.GlobalWindow"

This makes it impossible to prevent fusion of operations and to distribute our 
workload nicely.

 

For more context, see following post where [~robertwb] linked this Jira board:

[https://stackoverflow.com/questions/54764081/google-dataflow-streaming-pipeline-is-not-distributing-workload-over-several-wor]

Quick summary: Our use case is actually video analytics. We want to use the 
windowing to get small intervals of videos and the grouping to group per video 
stream. The group by key is thus a game id. What comes out the window and 
grouping is several windows within one game so with 1 game_id key and this does 
not distribute over several workers. The proposed workaround would be to add 
the window_id to the key after windowing, then reshuffle and then the 
processing would be able to run in parallel per window in one videostream. Can 
someone confirm this approach (if reshuffle would work).

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to