yihua commented on issue #5481:
URL: https://github.com/apache/hudi/issues/5481#issuecomment-1116452731

   @MikeBuh Thanks for the clarification.  What is the input size of your batch 
reload?  The similar principle can be applied here for calculating the 
parallelism.  To be conservative at first, you can calculate the parallelism to 
use by deriving [input size / 100MB].  Say if the input size is 50GB, you can 
try parallelism as 500 as below:
   ```
   spark.default.parallelism: 500
   spark.sql.shuffle.partitions: 500
   hoodie.upsert.shuffle.parallelism: 500
   ```
   You should also increase the Kryo serializer buffer:
   ```
   spark.kryoserializer.buffer.max: 1024m
   ```
   The executor and driver memory should be good as they are already large 
enough.
   
   Once you get a successful run, you may further reduce the parallelism to 
find the sweet spot so that more memory can be leveraged without OOM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to