[GitHub] [hudi] Rap70r opened a new issue #3697: [SUPPORT] Performance Tuning: How to speed up stages?

GitBox Thu, 02 Dec 2021 10:59:14 -0800


Rap70r opened a new issue #3697:
URL: https://github.com/apache/hudi/issues/3697



   Hello,
   We are using Spark and Hudi to upsert records into parquet in S3, extracted 
from Kafka, using EMR. The events could be either inserts or updates.
   Currently, it takes 41 minutes for the process to extract and upsert 
1,430,000 records (1714 Megabytes).
   We are trying to increase the speed of this process. Below are the details 
of our environment
   
   **Environment Description**
   
   * Hudi version : 0.9.0
   
   * EMR version : 6.4.0
       > Master Instance: 1 r5.xlarge
       > Core Instance: 1 c5.xlarge
       > Task Instance: 25 c5.xlarge
   
   * Spark version : 3.1.2
   
   * Hive version : n/a
   
   * Hadoop version : 3.2.1
   
   * Source : Kafka
   
   * Storage : S3 (as parquet)
   
   * Partitions: 1100
   
   * Partition Size: ~1MB to 30MB each
   
   * Parallelism: 3000
   
   * Operation: Upsert
   
   * Key: Concatenation of few fields
   
   * Partition : Concatenation of year, month and week of a date field
   
   * Storage Type: COPY_ON_WRITE
   
   * Running on Docker? : no
   
   **Spark-Submit Configs**
   `spark-submit --deploy-mode cluster --conf 
spark.dynamicAllocation.enabled=true --conf 
spark.dynamicAllocation.cachedExecutorIdleTimeout=300s --conf 
spark.dynamicAllocation.executorIdleTimeout=300s --conf 
spark.scheduler.mode=FAIR --conf spark.memory.fraction=0.4 --conf 
spark.memory.storageFraction=0.1 --conf spark.shuffle.service.enabled=true 
--conf spark.sql.hive.convertMetastoreParquet=false --conf 
spark.sql.parquet.mergeSchema=true --conf spark.driver.maxResultSize=4g --conf 
spark.driver.memory=4g --conf spark.executor.cores=4 --conf 
spark.driver.memoryOverhead=1g --conf spark.executor.instances=100 --conf 
spark.executor.memoryOverhead=1g --conf spark.driver.cores=4 --conf 
spark.executor.memory=4g --conf spark.rdd.compress=true --conf 
spark.kryoserializer.buffer.max=512m --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
spark.yarn.nodemanager.vmem-check-enabled=false --conf 
yarn.nodemanager.pmem-check-enabled=false --conf spark.sql.shuffle.partitions=10
 0 --conf spark.default.parallelism=100 --conf spark.task.cpus=2`
   
   **Spark Job**
   
![image](https://user-images.githubusercontent.com/22181358/134231023-4aa94788-5f68-4610-843c-1e98187aa810.png)
   
   From the job above, it seems that most of the time is consumed by 
UpsertPartitioner and SparkUpsertCommitActionExecutor events.
   
   Do you have any suggestions on how to reduce the time above job takes to 
complete?
   
   Thank you
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] Rap70r opened a new issue #3697: [SUPPORT] Performance Tuning: How to speed up stages?

Reply via email to