[GitHub] [hudi] MikeBuh commented on issue #5481: [SUPPORT] Slow Upsert When Reloading Data into Hudi Table

GitBox Wed, 04 May 2022 00:04:31 -0700


MikeBuh commented on issue #5481:
URL: https://github.com/apache/hudi/issues/5481#issuecomment-1116990374


   Hi @yihua the batches I am trying to load are around 9GB each. For the 
latest test I have tried to load only 2 of these batches but not even one of 
them managed to be processed successfully. I tried 2 runs and both failed with 
the executor running out of memory. 
   
   For both runs I was using the same Spark resources, with the only difference 
being the parallelism (both for Spark and Hudi): 
   
   **Common Spark Parameters**
   > spark.driver.cores: 5
   > spark.driver.memory: 24100m
   > spark.driver.memoryOverhead: 2680m
   > 
   > spark.executor.instances: 10
   > spark.executor.cores: 5
   > spark.executor.memory: 24100m
   > spark.executor.memoryOverhead: 2680m
   > spark.memory.storageFraction: 0.6
   > spark.memory.fraction: 0.7
   > 
   > spark.kryoserializer.buffer.max: 1024m
   > 
   > spark.driver.extraJavaOptions: -Xloggc:/var/log/spark-GClog.log 
-XX:+PrintGC -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions 
-XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc 
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'
   > 
   > spark.executor.extraJavaOptions: -Xloggc:/var/log/spark-GClog.log 
-XX:+PrintGC -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions 
-XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc 
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'
   
   **Run 1 Parallelism** 
   > spark.default.parallelism: 100
   > spark.sql.shuffle.partitions: 100
   > hoodie.upsert.shuffle.parallelism: 100
   
   **Run 2 Parallelism** 
   > spark.default.parallelism: 250
   > spark.sql.shuffle.partitions: 250
   > hoodie.upsert.shuffle.parallelism: 250
   
   Following the above and unfortunate persisting failures, might any of the 
following effect the performance and/or have anything to do with the required 
resources? 
   - size of target table: I noticed that reloading the same batches to a near 
empty table is more successful 
   - file sizes: maybe having less but larger files in the target table can 
help when comparing and updating 
   - compaction and cleanup: if these are heavy operations that need lots of 
memory then perhaps they can be tweaked 
   
   Thanks once again for your reply


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] MikeBuh commented on issue #5481: [SUPPORT] Slow Upsert When Reloading Data into Hudi Table

Reply via email to