yihua commented on issue #5481: URL: https://github.com/apache/hudi/issues/5481#issuecomment-1116452731
@MikeBuh Thanks for the clarification. What is the input size of your batch reload? The similar principle can be applied here for calculating the parallelism. To be conservative at first, you can calculate the parallelism to use by deriving [input size / 100MB]. Say if the input size is 50GB, you can try parallelism as 500 as below: ``` spark.default.parallelism: 500 spark.sql.shuffle.partitions: 500 hoodie.upsert.shuffle.parallelism: 500 ``` You should also increase the Kryo serializer buffer: ``` spark.kryoserializer.buffer.max: 1024m ``` The executor and driver memory should be good as they are already large enough. Once you get a successful run, you may further reduce the parallelism to find the sweet spot so that more memory can be leveraged without OOM. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
