[GitHub] [hudi] nsivabalan commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

GitBox Wed, 06 Oct 2021 23:05:54 -0700


nsivabalan commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-937477763



   I went over your latest messages. 
   guess you interchanged upsert and bulk_insert commands while posting above. 
nvm. 
   
   Let me comment on each command. 
   1. I see that we have added lot of custom options w/ spark submit. when I 
have done benchmarking, 100Gb could get bulk_inserted in 1 to 2 mins for simple 
record keys and partition path. So, definitely something strange going on. 
   Can we try to remove all custom options and try simple command. Does your 
executor have 48G memory? just confirming? 
   
   I have tried to trim few configs. But lets try to keep some minimal so that 
once we get a good perf run, we can add back these configs and see which one is 
causing the spike in perf. 
   
   ```
   spark-submit --master yarn --deploy-mode client --num-executors 100 
--driver-memory 12G --executor-memory 48G \ --conf 
spark.yarn.executor.memoryOverhead=8192 \ --conf 
spark.executor.extraJavaOptions="-XX:+UseG1GC" \ --conf 
spark.shuffle.io.numConnectionsPerPeer=3 \ --conf 
spark.shuffle.file.buffer=512k \ --conf spark.memory.fraction=0.7 \ --conf 
spark.memory.storageFraction=0.5 \ --conf 
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \ --conf 
spark.hadoop.fs.s3a.connection.maximum=2000 \ --conf 
spark.hadoop.fs.s3a.fast.upload=true \ --conf 
spark.hadoop.fs.s3a.connection.establish.timeout=500 \ --conf 
spark.hadoop.fs.s3a.connection.timeout=5000 \ --conf 
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \ --conf 
spark.hadoop.com.amazonaws.services.s3.enableV4=true \ --conf 
spark.hadoop.com.amazonaws.services.s3.enforceV4=true \ --conf 
spark.driver.cores=4 \ --conf spark.executor.cores=3 \ --conf 
spark.yarn.driver.memoryOverhead=8192 \ --conf spark.yarn.
 max.executor.failures=100  \ --conf spark.rdd.compress=true \ --conf 
spark.yarn.maxAppAttempts=3 \ --conf spark.network.timeout=800 \ --conf 
spark.shuffle.service.enabled=true \ --conf spark.task.maxFailures=4 \ --conf 
spark.driver.maxResultSize=2g \ --conf spark.hadoop.fs.s3.maxRetries=2 \ --conf 
spark.kryoserializer.buffer.max=1024m \ --conf 
spark.kryo.registrationRequired=false \ --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf 
spark.sql.shuffle.partitions=1536 \ --class <class-name> \ --jars 
/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar 
\ <jar-file-name>.jar
   ```
   For eg: when I did bulk_insert benchmarking, I used the below w/ spark-shell
   ```
   ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.0.1 --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 
'spark.kryoserializer.buffer.max=1024m' --driver-memory 8g --executor-memory 9g 
  --master yarn --deploy-mode client  --num-executors 15 --executor-cores 8  
--conf spark.rdd.compress=true       --conf 
spark.driver.userClassPathFirst=true     --conf 
spark.executor.userClassPathFirst=true        --conf spark.ui.proxyBase=""    
--conf "spark.memory.storageFraction=0.8"  --conf 
"spark.driver.extraClassPath=-XX:NewSize=1g -XX:SurvivorRatio=2 
-XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC 
-XX:CMSInitiatingOccupancyFraction=70"     --conf 
"spark.executor.extraClassPath=-XX:NewSize=1g -XX:SurvivorRatio=2 
-XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC 
-XX:CMSInitiatingOccupancyFraction=70" --conf 
'spark.executor.memoryOverhead=2000m'
   ```
   Nothing fancy, just set the appropriate memory, cores and some GC tuning 
configs and things worked for me. 
   
   - bulk_insert configs.
   lets increase the index parallelism to 1000. let remove the storage level 
configs. I mean, lets try to get some baseline and then iteratively we can add 
back more configs. I see you are setting parquet max file size in your upsert 
command. probably we need to set them here too. 
   
   - upsert configs.
   again lets set index parallelism to 1000 and remove storage level configs. 
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Reply via email to