Ambarish-Giri opened a new issue #3605:
URL: https://github.com/apache/hudi/issues/3605


   **_Tips before filing an issue_**
   
   - Have you gone through our 
[FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   Hi Team, 
   I was testing Hudi for doing inserts/updates/deletes on data in S3.  Below 
are benchmark metrics captured so far on varied data sizes: 
   
   Run 1 - Fresh Insert
   -----------------------
   Total Data size = 7 GB
   
   
   COW = 22 mins
   MOR = 25 mins
   
   
   
   Run 2 - Upsert
   --------------------
   Total Data Size=6.7 GB
   
   COW = 61 mins
   MOR = 64 mins
   
   
   Run 3 - Upsert
   -------------------
   Total Data size:  2.5 GB
   
   COW = 39 mins
   MOR = 53 mins
   
   Below are cluster configurations used:
   EMR Version : 5.33.0
   Hudi: 0.7.0
   Spark: 2.4.7
   Scala: 2.11.12
   Static cluster with 1 Master (m5.xlarge)  , 4 * (m5.2xlarge) core and 4 * 
(m5.2xlarge) task nodes
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Execute Hudi insert/usert on text data stored in S3 
   2. The spark-submit is issued on EMR 5.33.0 
   3. Hudi 0.7.0 and Scala 2.11.12 is used
   4.
   
   **Expected behavior**
   
   Not expecting that Hudi will take so much time to write to Hudi Store. 
Expectation was it should take 15-20 mins time at max for data of size (7-8 GB) 
both inserts/upserts. Also for even writes CoW write strategy was performing 
better compared to MoR which I thought would have been vice versa.
   
   **Environment Description**
   
   * Hudi version : 0.7.0
   
   * Spark version : 2.4.7
   
   * Hive version : 2.3.7
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   This is a complete batch job, we receive daily loads and upserts are 
supposed to be performed over existing Hudi Tables.
   
   Static EMR cluster:  1 Master (m5.xlarge) node  , 4 * (m5.2xlarge) core 
nodes and 4 * (m5.2xlarge) task nodes
   Spark submit command  ::
   spark-submit --master yarn --num-executors 8 --driver-memory 4G 
--executor-memory 20G \
        --conf spark.yarn.executor.memoryOverhead=4096 \
        --conf spark.yarn.maxAppAttempts=3 \
        --conf spark.executor.cores=5 \
        --conf spark.segment.etl.numexecutors=8 \
        --conf spark.network.timeout=800 \
        --conf spark.shuffle.minNumPartitionsToHighlyCompress=32 \
        --conf spark.segment.processor.partition.count=500 \
        --conf spark.segment.processor.output-shard.count=60 \
        --conf 
spark.segment.processor.binseg.partition.threshold.bytes=500000000000 \
        --conf spark.driver.maxResultSize=0 \
        --conf spark.hadoop.fs.s3.maxRetries=20 \
        --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
        --conf spark.sql.shuffle.partitions=500 \
        --conf spark.kryo.registrationRequired=false \
        --class <class-name> \
        --jars 
/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar  
\
        s3://<jar-name>
   
   HUDI insert and upsert parameters:
   userSegDf.write
         .format("hudi")
         .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
if(hudiWriteStrg=="MOR") DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL else 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL)
         .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, keyGenClass)
         .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, key)
         .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
partitionKey)
         .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, combineKey)
         .option(HoodieWriteConfig.TABLE_NAME, tableName)
         .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
         .option("hoodie.upsert.shuffle.parallelism", "2")
         .mode(SaveMode.Overwrite)
         .save(s"$basePath/$tableName/")
   
   userSegDf.write
         .format("hudi")
         .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
if(hudiWriteStrg=="MOR") DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL else 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL)
         .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, keyGenClass)
         .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, key)
         .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
partitionKey)
         .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, combineKey)
         .option(HoodieWriteConfig.TABLE_NAME, tableName)
         .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
         .mode(SaveMode.Append)
         .save(s"$basePath/$tableName/")
   
   
   I have tried to run a full production load on 53 GB of data size on 
production cluster with the below cluster configuration and spark submit 
command for Hudi insert using COW write strategy ...I observed that it is 
taking more than 2 hrs just for insert and it is quite evident from the earlier 
runs that I will take even more time for upsert operation.
   
   Tota Data size: 53 GB
   Cluster Size:1 Master (m5.xlarge) node  , 2* (r5a.24xlarge) core nodes and 6 
* (r5a.24xlarge) task nodes
   Spark submit command  ::
   spark-submit --master yarn --num-executors 192 --driver-memory 4G 
--executor-memory 20G \
        --conf spark.yarn.executor.memoryOverhead=4096 \
        --conf spark.yarn.maxAppAttempts=3 \
        --conf spark.executor.cores=4 \
        --conf spark.segment.etl.numexecutors=192 \
        --conf spark.network.timeout=800 \
        --conf spark.shuffle.minNumPartitionsToHighlyCompress=32 \
        --conf spark.segment.processor.partition.count=1536 \
        --conf spark.segment.processor.output-shard.count=60 \
        --conf 
spark.segment.processor.binseg.partition.threshold.bytes=500000000000 \
        --conf spark.driver.maxResultSize=0 \
        --conf spark.hadoop.fs.s3.maxRetries=20 \
        --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
        --conf spark.sql.shuffle.partitions=1536 \
        --conf spark.kryo.registrationRequired=false \
        --class <class-name> \
         --jars 
/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar  
\
        s3://<jar-name>
    
   Hudi insert and Upsert parameters being same as above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to