Ambarish-Giri opened a new issue #3605: URL: https://github.com/apache/hudi/issues/3605
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)? - Join the mailing list to engage in conversations and get faster support at [email protected]. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. Hi Team, I was testing Hudi for doing inserts/updates/deletes on data in S3. Below are benchmark metrics captured so far on varied data sizes: Run 1 - Fresh Insert ----------------------- Total Data size = 7 GB COW = 22 mins MOR = 25 mins Run 2 - Upsert -------------------- Total Data Size=6.7 GB COW = 61 mins MOR = 64 mins Run 3 - Upsert ------------------- Total Data size: 2.5 GB COW = 39 mins MOR = 53 mins Below are cluster configurations used: EMR Version : 5.33.0 Hudi: 0.7.0 Spark: 2.4.7 Scala: 2.11.12 Static cluster with 1 Master (m5.xlarge) , 4 * (m5.2xlarge) core and 4 * (m5.2xlarge) task nodes **To Reproduce** Steps to reproduce the behavior: 1. Execute Hudi insert/usert on text data stored in S3 2. The spark-submit is issued on EMR 5.33.0 3. Hudi 0.7.0 and Scala 2.11.12 is used 4. **Expected behavior** Not expecting that Hudi will take so much time to write to Hudi Store. Expectation was it should take 15-20 mins time at max for data of size (7-8 GB) both inserts/upserts. Also for even writes CoW write strategy was performing better compared to MoR which I thought would have been vice versa. **Environment Description** * Hudi version : 0.7.0 * Spark version : 2.4.7 * Hive version : 2.3.7 * Hadoop version : * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : No **Additional context** This is a complete batch job, we receive daily loads and upserts are supposed to be performed over existing Hudi Tables. Static EMR cluster: 1 Master (m5.xlarge) node , 4 * (m5.2xlarge) core nodes and 4 * (m5.2xlarge) task nodes Spark submit command :: spark-submit --master yarn --num-executors 8 --driver-memory 4G --executor-memory 20G \ --conf spark.yarn.executor.memoryOverhead=4096 \ --conf spark.yarn.maxAppAttempts=3 \ --conf spark.executor.cores=5 \ --conf spark.segment.etl.numexecutors=8 \ --conf spark.network.timeout=800 \ --conf spark.shuffle.minNumPartitionsToHighlyCompress=32 \ --conf spark.segment.processor.partition.count=500 \ --conf spark.segment.processor.output-shard.count=60 \ --conf spark.segment.processor.binseg.partition.threshold.bytes=500000000000 \ --conf spark.driver.maxResultSize=0 \ --conf spark.hadoop.fs.s3.maxRetries=20 \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.sql.shuffle.partitions=500 \ --conf spark.kryo.registrationRequired=false \ --class <class-name> \ --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar \ s3://<jar-name> HUDI insert and upsert parameters: userSegDf.write .format("hudi") .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, if(hudiWriteStrg=="MOR") DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL else DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL) .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, keyGenClass) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, key) .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, partitionKey) .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, combineKey) .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) .option("hoodie.upsert.shuffle.parallelism", "2") .mode(SaveMode.Overwrite) .save(s"$basePath/$tableName/") userSegDf.write .format("hudi") .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, if(hudiWriteStrg=="MOR") DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL else DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL) .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, keyGenClass) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, key) .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, partitionKey) .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, combineKey) .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) .mode(SaveMode.Append) .save(s"$basePath/$tableName/") I have tried to run a full production load on 53 GB of data size on production cluster with the below cluster configuration and spark submit command for Hudi insert using COW write strategy ...I observed that it is taking more than 2 hrs just for insert and it is quite evident from the earlier runs that I will take even more time for upsert operation. Tota Data size: 53 GB Cluster Size:1 Master (m5.xlarge) node , 2* (r5a.24xlarge) core nodes and 6 * (r5a.24xlarge) task nodes Spark submit command :: spark-submit --master yarn --num-executors 192 --driver-memory 4G --executor-memory 20G \ --conf spark.yarn.executor.memoryOverhead=4096 \ --conf spark.yarn.maxAppAttempts=3 \ --conf spark.executor.cores=4 \ --conf spark.segment.etl.numexecutors=192 \ --conf spark.network.timeout=800 \ --conf spark.shuffle.minNumPartitionsToHighlyCompress=32 \ --conf spark.segment.processor.partition.count=1536 \ --conf spark.segment.processor.output-shard.count=60 \ --conf spark.segment.processor.binseg.partition.threshold.bytes=500000000000 \ --conf spark.driver.maxResultSize=0 \ --conf spark.hadoop.fs.s3.maxRetries=20 \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.sql.shuffle.partitions=1536 \ --conf spark.kryo.registrationRequired=false \ --class <class-name> \ --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar \ s3://<jar-name> Hudi insert and Upsert parameters being same as above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
