MikeBuh commented on issue #5481: URL: https://github.com/apache/hudi/issues/5481#issuecomment-1115792891
@yihua Thank-you for the reply on this however please note the following: - real-time is working fine, I just included those to showcase what it is currently using for new data - batch reload is having the issues and failing with out of memory errors (error code 137) - the Spark UI screenshots and detailed configs are related to this reload job - perhaps I was not clear, the resources used by the reload job are the default EMR ones specified in that config (re-adding below to have full clarity): **Spark Parameters** > spark.driver.cores: 5 > spark.driver.memory: 24100m > spark.driver.memoryOverhead: 2680m > > spark.executor.instances: 10 > spark.executor.cores: 5 > spark.executor.memory: 24100m > spark.executor.memoryOverhead: 2680m > spark.memory.storageFraction: 0.6 > spark.memory.fraction: 0.7 > spark.default.parallelism: 100 > spark.sql.shuffle.partitions: 100 > spark.kryoserializer.buffer.max: 128m > > spark.driver.extraJavaOptions: -Xloggc:/var/log/spark-GClog.log -XX:+PrintGC -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p' > spark.executor.extraJavaOptions: -Xloggc:/var/log/spark-GClog.log -XX:+PrintGC -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p' **Hudi Parameters** > hoodie.index.type: BLOOM > hoodie.datasource.write.operation: UPSERT > hoodie.upsert.shuffle.parallelism: 100 > hoodie.payload.ordering.field: hoodie.datasource.write.precombine.field > hoodie.datasource.write.payload.class: org.apache.hudi.common.model.DefaultHoodieRecordPayload Thanks once again for your prompt reply and I hope you can assist me with this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
