Ambarish-Giri commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-920561496
Hi @nsivabalan ,
We have been trying to optimize the upsert but still the 44GB upsert over a
54 GB bulk-insert in a fairly big cluster is taking more than 3 hrs. Below in
the EMR cluster configuration and the Upsert config:
userSegDf.write
.format("hudi")
.option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY,
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL)
.option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY,
keyGenClass)
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, key)
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,
partitionKey)
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, combineKey)
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.option(HoodieIndexConfig.INDEX_TYPE_PROP,HoodieIndex.IndexType.SIMPLE.toString())
.option(HoodieIndexConfig.SIMPLE_INDEX_PARALLELISM_PROP,50)
.option(HoodieMetadataConfig.METADATA_ENABLE_PROP, true)
.option(DataSourceWriteOptions.OPERATION_OPT_KEY,
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
.option(DataSourceWriteOptions.ENABLE_ROW_WRITER_OPT_KEY, true)
.option(HoodieWriteConfig.UPSERT_PARALLELISM, 200)
.option(HoodieWriteConfig.COMBINE_BEFORE_UPSERT_PROP, false)
.option(HoodieWriteConfig.WRITE_BUFFER_LIMIT_BYTES, 41943040)
.option(HoodieCompactionConfig.COPY_ON_WRITE_TABLE_RECORD_SIZE_ESTIMATE, 100)
.option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, true)
.mode(SaveMode.Append)
.save(s"$basePath/$tableName/")
Cluster config:
Static EMR cluster: 1 Master (m5.xlarge) node and 8 * (r5d.24xlarge) core
nodes
Spark-Submit Command 👍
spark-submit --master yarn --deploy-mode client \
--num-executors 192 --driver-memory 4G --executor-memory 20G \
--conf spark.yarn.executor.memoryOverhead=4096 \
--conf spark.yarn.driver.memoryOverhead=2048 \
--conf spark.yarn.max.executor.failures=100 \
--conf spark.task.cpus=1 \
--conf spark.rdd.compress=true \
--conf spark.kryoserializer.buffer.max=512m \
--conf spark.yarn.maxAppAttempts=3 \
--conf spark.executor.cores=4 \
--conf spark.segment.etl.numexecutors=192 \
--conf spark.network.timeout=800 \
--conf spark.shuffle.service.enabled=true \
--conf spark.sql.hive.convertMetastoreParquet=false \
--conf spark.task.maxFailures=4 \
--conf spark.shuffle.minNumPartitionsToHighlyCompress=32 \
--conf spark.segment.processor.partition.count=1536 \
--conf spark.segment.processor.output-shard.count=60 \
--conf
spark.segment.processor.binseg.partition.threshold.bytes=500000000000 \
--conf spark.driver.maxResultSize=0 \
--conf spark.hadoop.fs.s3.maxRetries=20 \
--conf spark.kryoserializer.buffer.max=512m \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.sql.shuffle.partitions=3000 \
--class <class-name>\
--jars
/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar
\
s3://<application>.jar
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]