I don't know how to proceed. I am convinced that in batch mode my proposal is the correct way to proceed. Another example of a silly interaction that occurs do to using defaultParallelism in SourceRDD is reading 2 different files. If one of the two files is a couple of orders of magnitude larger you will need to allocate enough resources to the job to read the larger file, lets say n cores, then the smaller file will get split into n pieces which will result in the smaller file being broken up into many very small bundles.
The issue is I do not understand the repercussions this change will have on the streaming mode. Maybe we will need to have two different approaches to the groupBy logic, one for each mode. I am ok with this being experimental and only working if you supply the --bundleSize to the pipeline options. I would like an answer to the last question I asked to understand if in batch mode I can always use the new experimental groupByKeyOnlyDefaultPartitioner because I believe it will not cause a double shuffle in batch mode. Other than that I believe I need a code review and make sure everyone agrees with the approach. If this is not agreed upon I would hope someone could give me some advice on how to get the SparkRunner to work with dynamicAllocation. (Starting with 2 cores and spinning up more if the files are large and are split into more bundles.) [ Full content available at: https://github.com/apache/beam/pull/6181 ] This message was relayed via gitbox.apache.org for [email protected]
