[ 
https://issues.apache.org/jira/browse/BEAM-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía reassigned BEAM-8191:
----------------------------------

    Assignee: Peter Backx

> Multiple Flatten.pCollections() transforms generate an overwhelming number of 
> tasks
> -----------------------------------------------------------------------------------
>
>                 Key: BEAM-8191
>                 URL: https://issues.apache.org/jira/browse/BEAM-8191
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-spark
>    Affects Versions: 2.12.0, 2.14.0, 2.15.0
>            Reporter: Peter Backx
>            Assignee: Peter Backx
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> The Flatten.pCollections() is translated into a Spark union operation. The 
> resulting RDD will have the sum of the partitions in the originating RDDs.
> If you flatten 2 PCollections with each 10 partitions, the result will have 
> 20 partitions.
> This is ok in small pipelins, but in our main pipeline, this means the number 
> of tasks grows out of hand quite easily (over 500k tasks in one stage). This 
> overloads the driver and crashes the process.
> I have created a small repro case:
> [https://github.com/pbackx/beam-flatmap-test]
>  
> A possible solution is to add a coalesce call after the union. We have been 
> testing this and it seems to do exactly what we want, but I'm not sure if 
> this fix is applicable for all cases. 
> I will open a PR for this so that you can review my proposed change and 
> discuss whether or not it's a good idea.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to