[ https://issues.apache.org/jira/browse/BEAM-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ismaël Mejía updated BEAM-8191: ------------------------------- Status: Open (was: Triage Needed) > Multiple Flatten.pCollections() transforms generate an overwhelming number of > tasks > ----------------------------------------------------------------------------------- > > Key: BEAM-8191 > URL: https://issues.apache.org/jira/browse/BEAM-8191 > Project: Beam > Issue Type: Bug > Components: runner-spark > Affects Versions: 2.12.0, 2.14.0, 2.15.0 > Reporter: Peter Backx > Assignee: Peter Backx > Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > The Flatten.pCollections() is translated into a Spark union operation. The > resulting RDD will have the sum of the partitions in the originating RDDs. > If you flatten 2 PCollections with each 10 partitions, the result will have > 20 partitions. > This is ok in small pipelins, but in our main pipeline, this means the number > of tasks grows out of hand quite easily (over 500k tasks in one stage). This > overloads the driver and crashes the process. > I have created a small repro case: > [https://github.com/pbackx/beam-flatmap-test] > > A possible solution is to add a coalesce call after the union. We have been > testing this and it seems to do exactly what we want, but I'm not sure if > this fix is applicable for all cases. > I will open a PR for this so that you can review my proposed change and > discuss whether or not it's a good idea. -- This message was sent by Atlassian Jira (v8.3.2#803003)