[
https://issues.apache.org/jira/browse/BEAM-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Peter Backx updated BEAM-8191:
------------------------------
Resolution: Won't Do
Status: Resolved (was: Open)
See my previous comment.
> Multiple Flatten.pCollections() transforms generate an overwhelming number of
> tasks
> -----------------------------------------------------------------------------------
>
> Key: BEAM-8191
> URL: https://issues.apache.org/jira/browse/BEAM-8191
> Project: Beam
> Issue Type: Bug
> Components: runner-spark
> Affects Versions: 2.12.0, 2.14.0, 2.15.0
> Reporter: Peter Backx
> Priority: P3
> Time Spent: 3h 40m
> Remaining Estimate: 0h
>
> The Flatten.pCollections() is translated into a Spark union operation. The
> resulting RDD will have the sum of the partitions in the originating RDDs.
> If you flatten 2 PCollections with each 10 partitions, the result will have
> 20 partitions.
> This is ok in small pipelins, but in our main pipeline, this means the number
> of tasks grows out of hand quite easily (over 500k tasks in one stage). This
> overloads the driver and crashes the process.
> I have created a small repro case:
> [https://github.com/pbackx/beam-flatmap-test]
>
> A possible solution is to add a coalesce call after the union. We have been
> testing this and it seems to do exactly what we want, but I'm not sure if
> this fix is applicable for all cases.
> I will open a PR for this so that you can review my proposed change and
> discuss whether or not it's a good idea.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)