[GitHub] [hudi] tverdokhlebd opened a new issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records

GitBox Tue, 27 Apr 2021 17:06:34 -0700


tverdokhlebd opened a new issue #1491:
URL: https://github.com/apache/hudi/issues/1491



   Hello. I have config:
   
   docker run --rm -v 
/var/lib/jenkins-slave/workspace/Transfer_ml_data_to_s3_hudi:/var/lib/jenkins-slave/workspace/Transfer_ml_data_to_s3_hudi
 
   -v /mnt/ml_data:/mnt/ml_data bde2020/spark-master:2.4.5-hadoop2.7 
   bash ./spark/bin/spark-submit 
   --master 'local[2]' 
   --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.2-incubating,org.apache.hadoop:hadoop-aws:2.7.3,org.apache.spark:spark-avro_2.11:2.4.4
 
   --conf spark.local.dir=/mnt/ml_data 
   --conf spark.ui.enabled=false 
   --conf spark.driver.memory=4g 
   --conf spark.driver.memoryOverhead=1024 
   --conf spark.driver.maxResultSize=2g 
   --conf spark.kryoserializer.buffer.max=512m 
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
   --conf spark.rdd.compress=true 
   --conf spark.shuffle.service.enabled=true 
   --conf spark.sql.hive.convertMetastoreParquet=false 
   --conf spark.executorEnv.hudi.outputPath=s3a://ir-mtu-ml-bucket/ml_hudi 
   --conf spark.executorEnv.hudi.tableName=ext_ml_data 
   --conf spark.executorEnv.hudi.recordKey=tds_cid 
   --conf spark.executorEnv.hudi.precombineKey=hit_timestamp 
   --conf spark.executorEnv.hudi.parallelism=8
   --conf spark.executorEnv.hudi.bulkInsertParallelism=8
   --class mtu.spark.analytics.ExtMLDataToS3
   
   I do:
   
   1. Bulk insert from Vertica to s3 storage with 53M records;
   2. Upsert from Vertica to the same s3 storage with the same 53M records;
   
   I received various errors during three tries:
   
   - java.util.concurrent.TimeoutException: Cannot receive any reply from 
c3dcd5b0c2ab:41498 in 10000 milliseconds
   - java.lang.OutOfMemoryError: Java heap space
   - The java.lang.OutOfMemoryError: GC overhead limit exceeded error
   
   I have read "Tunning guide" and tried to tune some technics - memory 
fractions (Decreased from default values), off-heap parameters, collector, 
parallelism and etc. I also increased driver memory from 4GB to 10GB, but it 
also does not help me.
   
   Stack trace with heap space 
https://drive.google.com/open?id=1-Kerkt2j-z_0zXal01rha55j1NzsxYto


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] tverdokhlebd opened a new issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records

Reply via email to