ChiehFu opened a new issue, #10121:
URL: https://github.com/apache/hudi/issues/10121

   Hello, 
   
   Recently we migrated our datasets from Hudi 0.8 to Hudi 0.12.3 and started 
experiencing slowness in writing stage where parquet files are being writing to 
S3.
   
   Below numbers were observed on a COW table of 12 GB in size and has 10 
partitions with parquet file size roughly lying between 30MB - 300MB.
   
   In an upsert job of 27,679 records with a total size of 26.8MB, we observed 
that each task in writing stage was taking up to 10 mins to write parquet file 
of size ranging from 30MB to 300MB. Individual task duration seems directly 
correlated to the size of the parquet file the task wrote, which makes sense, 
however, spending 10 mins on writing a 300MB parquet file into S3 seems 
extremely long.
   
   Can you please help us understand what might be causing such slowness in 
writing stage and if there is a way to improve the performance here?
   
   Complete spark job:
   <img width="3008" alt="Screenshot 2023-11-16 at 10 15 03 AM" 
src="https://github.com/apache/hudi/assets/11819388/4b6a17ab-50de-4c9b-9658-d29e6d72e823";>
   
   Writing stage:
   <img width="3008" alt="Screenshot 2023-11-16 at 10 15 21 AM" 
src="https://github.com/apache/hudi/assets/11819388/c6cb1822-df18-42bf-b2cc-685f3f1df41e";>
   <img width="3002" alt="Screenshot 2023-11-16 at 10 15 37 AM" 
src="https://github.com/apache/hudi/assets/11819388/41159777-22ba-4d1b-aa56-54b07c565b4a";>
   
   Hudi commit metadata for the upsert job:
   <img width="1650" alt="image" 
src="https://github.com/apache/hudi/assets/11819388/b9f04d13-fff4-4da4-8f8e-341b253e0e17";>
   <img width="1150" alt="image" 
src="https://github.com/apache/hudi/assets/11819388/e7713eea-7231-4d8e-a9d1-46239b591096";>
   
   
   Environment Description
   
   Hudi version : 0.12.3
   
   Spark version : 3.1.3
   
   Hive version : 3.1.3
   
   Hadoop version : 3.3.3
   
   Storage (HDFS/S3/GCS..) : S3
   
   Running on Docker? (yes/no) : no
   
   EMR: 6.10.0/6.10.1
   
   Additional context
   
   Hudi configs
   
   hoodie.metadata.enable: true
   hoodie.metadata.validate: true
   hoodie.cleaner.commits.retained: 72
   hoodie.keep.min.commits: 100
   hoodie.keep.max.commits: 150
   hoodie.datasource.write.payload.class: 
org.apache.hudi.common.model.DefaultHoodieRecordPayload
   hoodie.index.type: BLOOM
   hoodie.bloom.index.parallelism: 2000
   hoodie.metadata.enable: true
   hoodie.datasource.write.table.type: COPY_ON_WRITE
   hoodie.insert.shuffle.parallelism: 500
   hoodie.datasource.write.operation: upsert
   hoodie.datasource.hive_sync.partition_extractor_class: 
org.apache.hudi.hive.MultiPartKeysValueExtractor
   hoodie.datasource.write.keygenerator.class: 
org.apache.hudi.keygen.ComplexKeyGenerator


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to