nsivabalan commented on issue #3605: URL: https://github.com/apache/hudi/issues/3605#issuecomment-937477763
I went over your latest messages. guess you interchanged upsert and bulk_insert commands while posting above. nvm. Let me comment on each command. 1. I see that we have added lot of custom options w/ spark submit. when I have done benchmarking, 100Gb could get bulk_inserted in 1 to 2 mins for simple record keys and partition path. So, definitely something strange going on. Can we try to remove all custom options and try simple command. Does your executor have 48G memory? just confirming? I have tried to trim few configs. But lets try to keep some minimal so that once we get a good perf run, we can add back these configs and see which one is causing the spike in perf. ``` spark-submit --master yarn --deploy-mode client --num-executors 100 --driver-memory 12G --executor-memory 48G \ --conf spark.yarn.executor.memoryOverhead=8192 \ --conf spark.executor.extraJavaOptions="-XX:+UseG1GC" \ --conf spark.shuffle.io.numConnectionsPerPeer=3 \ --conf spark.shuffle.file.buffer=512k \ --conf spark.memory.fraction=0.7 \ --conf spark.memory.storageFraction=0.5 \ --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \ --conf spark.hadoop.fs.s3a.connection.maximum=2000 \ --conf spark.hadoop.fs.s3a.fast.upload=true \ --conf spark.hadoop.fs.s3a.connection.establish.timeout=500 \ --conf spark.hadoop.fs.s3a.connection.timeout=5000 \ --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \ --conf spark.hadoop.com.amazonaws.services.s3.enableV4=true \ --conf spark.hadoop.com.amazonaws.services.s3.enforceV4=true \ --conf spark.driver.cores=4 \ --conf spark.executor.cores=3 \ --conf spark.yarn.driver.memoryOverhead=8192 \ --conf spark.yarn. max.executor.failures=100 \ --conf spark.rdd.compress=true \ --conf spark.yarn.maxAppAttempts=3 \ --conf spark.network.timeout=800 \ --conf spark.shuffle.service.enabled=true \ --conf spark.task.maxFailures=4 \ --conf spark.driver.maxResultSize=2g \ --conf spark.hadoop.fs.s3.maxRetries=2 \ --conf spark.kryoserializer.buffer.max=1024m \ --conf spark.kryo.registrationRequired=false \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.sql.shuffle.partitions=1536 \ --class <class-name> \ --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar \ <jar-file-name>.jar ``` For eg: when I did bulk_insert benchmarking, I used the below w/ spark-shell ``` ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.0.1 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.kryoserializer.buffer.max=1024m' --driver-memory 8g --executor-memory 9g --master yarn --deploy-mode client --num-executors 15 --executor-cores 8 --conf spark.rdd.compress=true --conf spark.driver.userClassPathFirst=true --conf spark.executor.userClassPathFirst=true --conf spark.ui.proxyBase="" --conf "spark.memory.storageFraction=0.8" --conf "spark.driver.extraClassPath=-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70" --conf "spark.executor.extraClassPath=-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70" --conf 'spark.executor.memoryOverhead=2000m' ``` Nothing fancy, just set the appropriate memory, cores and some GC tuning configs and things worked for me. - bulk_insert configs. lets increase the index parallelism to 1000. let remove the storage level configs. I mean, lets try to get some baseline and then iteratively we can add back more configs. I see you are setting parquet max file size in your upsert command. probably we need to set them here too. - upsert configs. again lets set index parallelism to 1000 and remove storage level configs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
