Ambarish-Giri commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-920561496


   Hi @nsivabalan ,
   
   We have been trying to optimize the upsert but still the 44GB upsert over a 
54 GB bulk-insert in a fairly big cluster is taking more than 3 hrs. Below in 
the EMR cluster configuration and the Upsert config:
   
   userSegDf.write
         .format("hudi")
         .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL)
          .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
keyGenClass)
         .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, key)
         .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
partitionKey)
         .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, combineKey)
         .option(HoodieWriteConfig.TABLE_NAME, tableName)
         
.option(HoodieIndexConfig.INDEX_TYPE_PROP,HoodieIndex.IndexType.SIMPLE.toString())
         .option(HoodieIndexConfig.SIMPLE_INDEX_PARALLELISM_PROP,50)
         .option(HoodieMetadataConfig.METADATA_ENABLE_PROP, true)
         .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
         .option(DataSourceWriteOptions.ENABLE_ROW_WRITER_OPT_KEY, true)
         .option(HoodieWriteConfig.UPSERT_PARALLELISM, 200)
         .option(HoodieWriteConfig.COMBINE_BEFORE_UPSERT_PROP, false)
         .option(HoodieWriteConfig.WRITE_BUFFER_LIMIT_BYTES, 41943040)
         
.option(HoodieCompactionConfig.COPY_ON_WRITE_TABLE_RECORD_SIZE_ESTIMATE, 100)
         .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, true)
           .mode(SaveMode.Append)
         .save(s"$basePath/$tableName/")
   
   Cluster config:
   Static EMR cluster: 1 Master (m5.xlarge) node and 8 * (r5d.24xlarge) core 
nodes
   
   Spark-Submit Command 👍 
   
   spark-submit --master yarn --deploy-mode client \
        --num-executors 192 --driver-memory 4G --executor-memory 20G \
        --conf spark.yarn.executor.memoryOverhead=4096 \
        --conf spark.yarn.driver.memoryOverhead=2048 \
        --conf spark.yarn.max.executor.failures=100 \
        --conf spark.task.cpus=1 \
        --conf spark.rdd.compress=true \
        --conf spark.kryoserializer.buffer.max=512m \
        --conf spark.yarn.maxAppAttempts=3 \
        --conf spark.executor.cores=4 \
        --conf spark.segment.etl.numexecutors=192 \
        --conf spark.network.timeout=800 \
        --conf spark.shuffle.service.enabled=true \
        --conf spark.sql.hive.convertMetastoreParquet=false \
         --conf spark.task.maxFailures=4 \
        --conf spark.shuffle.minNumPartitionsToHighlyCompress=32 \
        --conf spark.segment.processor.partition.count=1536 \
        --conf spark.segment.processor.output-shard.count=60 \
        --conf 
spark.segment.processor.binseg.partition.threshold.bytes=500000000000 \
        --conf spark.driver.maxResultSize=0 \
        --conf spark.hadoop.fs.s3.maxRetries=20 \
        --conf spark.kryoserializer.buffer.max=512m \
        --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
        --conf spark.sql.shuffle.partitions=3000 \
        --class <class-name>\
        --jars 
/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar  
\
        s3://<application>.jar


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to