ChiehFu opened a new issue #4682: URL: https://github.com/apache/hudi/issues/4682
**Describe the problem you faced** Recently, we upgraded our testing environment from Hudi 0.8.0 to Hudi 0.10.0, and after the upgrade we noticed that upsert jobs for some of our existing tables run much slower compared to how they ran in Hudi 0.8.0. For our Hudi tables, we ran one bulk_insert job to ingest snapshot, and schedule an upsert job every 10 mins to ingest incremental updates after the completion of bulk_insert job. To reproduce the issue, we ran upsert job on a table with the size around 1.8T. The job took in 11 tsv files (< 150 MB in total) containing both new records and updates. In Hudi 0.8.0, the job took 8.5 mins to complete whereas in Hudi 0.10.0, the job took 19 mins. And we notice that the main difference seemed to come from the steps "Getting small files from partitions". 0.8.0 <img width="3004" alt="0_8_0" src="https://user-images.githubusercontent.com/11819388/150885241-e10fb7ae-4ae4-4a17-bb46-480ffef63cef.png"> 0.10.0 <img width="2794" alt="0_10_0" src="https://user-images.githubusercontent.com/11819388/150885263-532d1163-65bd-4a9c-a217-e1742e74bffc.png"> We also ran the same upsert job as a fresh table that has no pre-existing snapshot and incremental data, and the job in both 0.8.0 and 0.10.0 took around 8 mins to complete. Based on the result, we speculate that in Hudi 0.10.0, the upsert performance somehow drops as more upsert jobs completed which make the size of the table grow, whereas in Hudi 0.8.0, we didn't notice such kind of performance degradation. **Environment Description** - Hudi version : 0.10.0 - Spark version : 2.4.7 - Hive version : 2.3.7 - Hadoop version : 2.10.1 - Storage (HDFS/S3/GCS..) : S3 - Running on Docker? (yes/no) : no - AWS EMR: 5.33.0, 1 master(r6g.16xlarge) with 20 cores(r6g.16xlarge) **Additional context** Spark configs: ``` --deploy-mode cluster --executor-memory 43g --driver-memory 43g --executor-cores 6 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.hadoop.fs.s3.maxRetries=30 --conf spark.yarn.executor.memoryOverhead=5g ``` Hudi configs: ``` hoodie.consistency.check.enabled -> true hoodie.datasource.write.table.type -> "COPY_ON_WRITE" hoodie.datasource.write.keygenerator.class -> "org.apache.hudi.keygen.ComplexKeyGenerator" hoodie.upsert.shuffle.parallelism -> 1500 hoodie.parquet.max.file.size -> 500 * 1024 * 1024 hoodie.datasource.write.operation -> "upsert" hoodie.metadata.enable -> true hoodie.metadata.validate -> true hoodie.fail.on.timeline.archiving -> false hoodie.clean.automatic -> true hoodie.cleaner.commits.retained: 72 hoodie.keep.min.commits: 100 hoodie.keep.max.commits: 150 ``` Please let me know if you need any more information, thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
