[ 
https://issues.apache.org/jira/browse/BEAM-6735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Winkelman reassigned BEAM-6735:
------------------------------------

    Fix Version/s: 2.12.0
         Assignee: Kyle Winkelman

> WriteFiles with runner-determined sharding is forced to handle spilling
> -----------------------------------------------------------------------
>
>                 Key: BEAM-6735
>                 URL: https://issues.apache.org/jira/browse/BEAM-6735
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-core
>            Reporter: Kyle Winkelman
>            Assignee: Kyle Winkelman
>            Priority: P3
>              Labels: Clarified
>             Fix For: 2.12.0
>
>          Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> As a result of BEAM-2302, files in excess of WriteFiles 
> maxNumWritersPerBundle are shuffled to be written later. The downside to this 
> is that even if you can guarantee that maxNumWritersPerBundle is high enough 
> to handle all writes you still have to pay the overhead of this write now 
> being a MultiOutput ParDo.
> e.g. In the Spark Runner when a ParDo has multiple outputs the returned data 
> is cached and if using the disableCache pipeline option it would cause 
> recalculation and all the temp files would be written again.
> I'm sure that the Spark Runner is not the only runner that would benefit from 
> an optional setting for WriteFiles that would skip this spilling and simplify 
> the pipeline.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to