tverdokhlebd opened a new issue #1491: URL: https://github.com/apache/hudi/issues/1491
Hello. I have config: docker run --rm -v /var/lib/jenkins-slave/workspace/Transfer_ml_data_to_s3_hudi:/var/lib/jenkins-slave/workspace/Transfer_ml_data_to_s3_hudi -v /mnt/ml_data:/mnt/ml_data bde2020/spark-master:2.4.5-hadoop2.7 bash ./spark/bin/spark-submit --master 'local[2]' --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.2-incubating,org.apache.hadoop:hadoop-aws:2.7.3,org.apache.spark:spark-avro_2.11:2.4.4 --conf spark.local.dir=/mnt/ml_data --conf spark.ui.enabled=false --conf spark.driver.memory=4g --conf spark.driver.memoryOverhead=1024 --conf spark.driver.maxResultSize=2g --conf spark.kryoserializer.buffer.max=512m --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.rdd.compress=true --conf spark.shuffle.service.enabled=true --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.executorEnv.hudi.outputPath=s3a://ir-mtu-ml-bucket/ml_hudi --conf spark.executorEnv.hudi.tableName=ext_ml_data --conf spark.executorEnv.hudi.recordKey=tds_cid --conf spark.executorEnv.hudi.precombineKey=hit_timestamp --conf spark.executorEnv.hudi.parallelism=8 --conf spark.executorEnv.hudi.bulkInsertParallelism=8 --class mtu.spark.analytics.ExtMLDataToS3 I do: 1. Bulk insert from Vertica to s3 storage with 53M records; 2. Upsert from Vertica to the same s3 storage with the same 53M records; I received various errors during three tries: - java.util.concurrent.TimeoutException: Cannot receive any reply from c3dcd5b0c2ab:41498 in 10000 milliseconds - java.lang.OutOfMemoryError: Java heap space - The java.lang.OutOfMemoryError: GC overhead limit exceeded error I have read "Tunning guide" and tried to tune some technics - memory fractions (Decreased from default values), off-heap parameters, collector, parallelism and etc. I also increased driver memory from 4GB to 10GB, but it also does not help me. Stack trace with heap space https://drive.google.com/open?id=1-Kerkt2j-z_0zXal01rha55j1NzsxYto -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
