I don't know how to proceed.

I am convinced that in batch mode my proposal is the correct way to proceed. 
Another example of a silly interaction that occurs do to using 
defaultParallelism in SourceRDD is reading 2 different files. If one of the two 
files is a couple of orders of magnitude larger you will need to allocate 
enough resources to the job to read the larger file, lets say n cores, then the 
smaller file will get split into n pieces which will result in the smaller file 
being broken up into many very small bundles.

The issue is I do not understand the repercussions this change will have on the 
streaming mode. Maybe we will need to have two different approaches to the 
groupBy logic, one for each mode.

I am ok with this being experimental and only working if you supply the 
--bundleSize to the pipeline options. I would like an answer to the last 
question I asked to understand if in batch mode I can always use the new 
experimental groupByKeyOnlyDefaultPartitioner because I believe it will not 
cause a double shuffle in batch mode.

Other than that I believe I need a code review and make sure everyone agrees 
with the approach.

If this is not agreed upon I would hope someone could give me some advice on 
how to get the SparkRunner to work with dynamicAllocation. (Starting with 2 
cores and spinning up more if the files are large and are split into more 
bundles.)

[ Full content available at: https://github.com/apache/beam/pull/6181 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to