alexeykudinkin commented on PR #7723:
URL: https://github.com/apache/hudi/pull/7723#issuecomment-1399162347

   > 1: I am not sure if ParallelismHelper is really required as base class. we 
can simplify it as a static method which takes in a config property and 
numPartitions and returns deduced value.
   
   Unfortunately, we have to abstract it in a way that would make it work w/ 
Java/Flink/Spark.
   
   > 2: while I agree w/ the intent of the patch itself, we might need to think 
through if this will work for all cases and whether we are confident about 
switching this few days before code freeze. for eg, would this play a part in 
bloom indexes. 200 has been working w/o any issues even for larger batch for 
the most part. will think about the change and will post an update if I can 
think of scenarios that might be of interest to us.
   
   The only structural change compared to current state is that we're not going 
to be overriding parallelism w/ default value of 200. If user specifies the 
config, it will still take precedence.
   
   I was able to confirm in multiple benchmarks that avoiding setting 
parallelism w/ random value (200) brings considerable performance benefits:
   
   1. In case of bulk-insert: in that case we will follow natural partitioning 
of the dataset (ie we will have as many partitions as there are Parquet 
row-groups)
   2. In case of upsert/insert: in this case we might fallback to 
`spark.default.parallelism` which is deduce dynamically based on the # of cores 
available to the cluster which also seems superior to the existing state.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to