[GitHub] [hudi] yihua commented on pull request #7723: [HUDI-5363] Removing default value for shuffle parallelism configs

via GitHub Mon, 23 Jan 2023 10:33:25 -0800


yihua commented on PR #7723:
URL: https://github.com/apache/hudi/pull/7723#issuecomment-1400797484


   > The only structural change compared to current state is that we're not 
going to be overriding parallelism w/ default value of 200. If user specifies 
the config, it will still take precedence.
   > 
   > I was able to confirm in multiple benchmarks that avoiding setting 
parallelism w/ random value (200) brings considerable performance benefits:
   > 
   > 1. In case of bulk-insert: in that case we will follow natural 
partitioning of the dataset (ie we will have as many partitions as there are 
Parquet row-groups)
   > 2. In case of upsert/insert: in this case we might fallback to 
`spark.default.parallelism` which is deduce dynamically based on the # of cores 
available to the cluster which also seems superior to the existing state.
   
   @alexeykudinkin These are good scenarios to validate.  Could you also attach 
the screenshots of Spark UI here?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] yihua commented on pull request #7723: [HUDI-5363] Removing default value for shuffle parallelism configs

Reply via email to