[
https://issues.apache.org/jira/browse/BEAM-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daniel Halperin resolved BEAM-1848.
-----------------------------------
Resolution: Fixed
Fix Version/s: Not applicable
> GroupByKey stuck with more than one worker on Dataflow
> ------------------------------------------------------
>
> Key: BEAM-1848
> URL: https://issues.apache.org/jira/browse/BEAM-1848
> Project: Beam
> Issue Type: Bug
> Components: runner-dataflow, sdk-java-core
> Affects Versions: 0.6.0
> Reporter: peay
> Assignee: Kenneth Knowles
> Priority: Blocker
> Fix For: Not applicable
>
>
> I have a simple pipeline which has a sliding window, a {{GroupByKey}} and
> then a simple stateful {{DoFn}}. I run in batch mode ({{--streaming=false}})
> on Dataflow.
> On a very small dataset of a couple KBs, I can run the pipeline to
> completion. Dataflow does show "successful". On a larger dataset (but still
> very small, 100s MB read by source), the pipeline stays stuck, no matter how
> long I wait. In addition, it never really gets stuck at the same point. I
> expect about 340k output records, and never get more than 70k before it gets
> stuck.
> Dataflow always autoscales from 1 to 8 workers, which is my limit.
> Run A: after 30mins+: no elements added out of {{GroupByKey}}, but logs have
> repeating occurrences of
> {code}
> Proposing dynamic split of work unit xxxxxx;aaaaaa;bbbbbb at
> {"position":{"shufflePosition":"AAAAAAD_AP8A_wEAAQ"}}
> Refusing to split <unstarted in shuffle range
> [ShufflePosition(base64:AAAAAAD_AP8A_wD_AAE),
> ShufflePosition(base64:AAAAAQD_AP8A_wD_AAE))> at
> ShufflePosition(base64:AAAAAAD_AP8A_wEAAQ): unstarted
> Refused to split GroupingShuffleReader <unstarted in shuffle range
> [ShufflePosition(base64:AAAAAAD_AP8A_wD_AAE),
> ShufflePosition(base64:AAAAAQD_AP8A_wD_AAE))> at
> ShufflePosition(base64:AAAAAAD_AP8A_wEAAQ)
> {code}
> Run B: after a couple minutes, elements get added to output of
> {{GroupByKey}}, up to 56,128 and then stays stuck doing nothing, but logs
> have repeating occurrences of
> {code}
> Proposing dynamic split of work unit xxxxxx;ccccccc;dddddd at
> {"position":{"shufflePosition":"AAAAAQD_AP8A_wEAAQ"}}
> Refusing to split <unstarted in shuffle range
> [ShufflePosition(base64:AAAAAQD_AP8A_wD_AAE),
> ShufflePosition(base64:AAAAAgD_AP8A_wD_AAE))> at
> ShufflePosition(base64:AAAAAQD_AP8A_wEAAQ): unstarted
> Refused to split GroupingShuffleReader <unstarted in shuffle range
> [ShufflePosition(base64:AAAAAQD_AP8A_wD_AAE),
> ShufflePosition(base64:AAAAAgD_AP8A_wD_AAE))> at
> ShufflePosition(base64:AAAAAQD_AP8A_wEAAQ)
> {code}
> Run C: after 10mins: elements get added to output of {{GroupByKey}}, up to
> 70,262 and then stays stuck doing nothing. No logs as above as far as I can
> find.
> I've run this about a dozen times and it always gets stuck. I am trying out
> right now to run the pipeline with the worker limit set to one, and
> {{GroupByKey}} has output 150k so far, still increasing. This seems like a
> workaround, but using one worker only is not ideal.
> cc [[email protected]]
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)