[GitHub] [hudi] aleizu opened a new issue, #9322: [SUPPORT] Write performance problem - "Tagging" takes too long

via GitHub Mon, 31 Jul 2023 06:53:19 -0700


aleizu opened a new issue, #9322:
URL: https://github.com/apache/hudi/issues/9322


   **Describe the problem you faced**
   
   When writing files into S3 after batch streaming from Kafka, it takes around 
 2 hours to finish the step "Tagging" while the EMR Cluster looks like being 
almost idle,
   
![image](https://github.com/apache/hudi/assets/141039971/9e3d02b4-2d1b-401a-a4c7-27c908f384ca)
   
   It looks like only two executors are doing all the tasks (I don't know if 
this could be an issue),
   
![image](https://github.com/apache/hudi/assets/141039971/3f98b843-3689-4a83-bc6f-ffff80c2d614)
   
   This is running on AWS EMR with this setup:
   
   > MASTER: 1 x r5.8xlarge
   > CORE: 15 x r5.8xlarge
   
   It does not look like a memory problem,
   
![image](https://github.com/apache/hudi/assets/141039971/79a30eb0-bd1c-4219-9694-496f7b1f15cb)
   
   
   ### spark-config
   
   ```
   "maximizeResourceAllocation": "true",
   "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": "2",
   "spark.driver.maxResultSize": "0",
   "spark.sql.streaming.minBatchesToRetain": "360",
   "spark.sql.catalog.spark_catalog": 
"org.apache.spark.sql.hudi.catalog.HoodieCatalog",
   "spark.yarn.maxAppAttempts": "1",
   "spark.sql.optimizer.enableJsonExpressionOptimization": "false",
   "spark.sql.extensions": 
"org.apache.spark.sql.hudi.HoodieSparkSessionExtension",
   "spark.sql.adaptive.enabled": "true",
   "spark.hadoop.hive.metastore.client.factory.class": 
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
   "spark.sql.adaptive.coalescePartitions.enabled": "true",
   "spark.cleaner.referenceTracking.cleanCheckpoints": "true",
   "spark.dynamicAllocation.enabled": "true",
   "spark.sql.adaptive.skewJoin.enabled": "true"
   ``` 
   
   ### HUDI .write Options
   
   These are the options I'm using for the HUDI write:
   
   ```scala            
               .write
               .format("hudi")
               .option("hoodie.datasource.write.table.type", "COPY_ON_WRITE")
               
.option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.ComplexKeyGenerator")
               .option("hoodie.datasource.write.recordkey.field", 
"<unique-field>,<timestamp-field>,<hash-field>")
               
.option("hoodie.datasource.write.partitionpath.field","<text-field-partition-input>,<text-field-partition2-input>,<Text
 with Date CCYYMM>")
               .option("hoodie.datasource.write.precombine.field", 
"<timestamp-field>")
               .option("hoodie.table.name", <hudiTableName>)
               .option("hoodie.datasource.write.hive_style_partitioning", 
"true")
               .option("hoodie.metadata.enable", "true")
               .option("hoodie.metadata.insert.parallelism", "6")
               .option("hoodie.clean.async", "true")
               .option("hoodie.clean.automatic", "true")
               .option("hoodie.cleaner.policy", "KEEP_LATEST_BY_HOURS")
               .option("hoodie.cleaner.hours.retained", "168")
               .option("hoodie.datasource.write.operation", "upsert")
               .option("hoodie.metrics.on", "true")
               .option("hoodie.metrics.reporter.type", "CLOUDWATCH")
               .option("hoodie.metrics.cloudwatch.metric.prefix", "xxx_")
               
.option("hoodie.write.concurrency.mode","optimistic_concurrency_control")
               .option("hoodie.cleaner.policy.failed.writes","LAZY")
               
.option("hoodie.write.lock.provider","org.apache.hudi.client.transaction.lock.InProcessLockProvider")
               .mode("append")
               .save(<outputDirectory>)
   ```
   
   I've marked with "<.....>" the values I've manually replaced for privacy 
reasons.
   
   ### Versions
   
   * Hudi version :
   
   * Spark version : 3.2.1
   
   * Hive version : hudi-spark3.2-bundle_2.12:0.11.0
   
   * Hadoop version : org.apache.hadoop:hadoop-aws:3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   Thank you for any insights, and let me know if you require any extra 
information 🙂
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] aleizu opened a new issue, #9322: [SUPPORT] Write performance problem - "Tagging" takes too long

Reply via email to