ChiehFu opened a new issue, #10121: URL: https://github.com/apache/hudi/issues/10121
Hello, Recently we migrated our datasets from Hudi 0.8 to Hudi 0.12.3 and started experiencing slowness in writing stage where parquet files are being writing to S3. Below numbers were observed on a COW table of 12 GB in size and has 10 partitions with parquet file size roughly lying between 30MB - 300MB. In an upsert job of 27,679 records with a total size of 26.8MB, we observed that each task in writing stage was taking up to 10 mins to write parquet file of size ranging from 30MB to 300MB. Individual task duration seems directly correlated to the size of the parquet file the task wrote, which makes sense, however, spending 10 mins on writing a 300MB parquet file into S3 seems extremely long. Can you please help us understand what might be causing such slowness in writing stage and if there is a way to improve the performance here? Complete spark job: <img width="3008" alt="Screenshot 2023-11-16 at 10 15 03 AM" src="https://github.com/apache/hudi/assets/11819388/4b6a17ab-50de-4c9b-9658-d29e6d72e823"> Writing stage: <img width="3008" alt="Screenshot 2023-11-16 at 10 15 21 AM" src="https://github.com/apache/hudi/assets/11819388/c6cb1822-df18-42bf-b2cc-685f3f1df41e"> <img width="3002" alt="Screenshot 2023-11-16 at 10 15 37 AM" src="https://github.com/apache/hudi/assets/11819388/41159777-22ba-4d1b-aa56-54b07c565b4a"> Hudi commit metadata for the upsert job: <img width="1650" alt="image" src="https://github.com/apache/hudi/assets/11819388/b9f04d13-fff4-4da4-8f8e-341b253e0e17"> <img width="1150" alt="image" src="https://github.com/apache/hudi/assets/11819388/e7713eea-7231-4d8e-a9d1-46239b591096"> Environment Description Hudi version : 0.12.3 Spark version : 3.1.3 Hive version : 3.1.3 Hadoop version : 3.3.3 Storage (HDFS/S3/GCS..) : S3 Running on Docker? (yes/no) : no EMR: 6.10.0/6.10.1 Additional context Hudi configs hoodie.metadata.enable: true hoodie.metadata.validate: true hoodie.cleaner.commits.retained: 72 hoodie.keep.min.commits: 100 hoodie.keep.max.commits: 150 hoodie.datasource.write.payload.class: org.apache.hudi.common.model.DefaultHoodieRecordPayload hoodie.index.type: BLOOM hoodie.bloom.index.parallelism: 2000 hoodie.metadata.enable: true hoodie.datasource.write.table.type: COPY_ON_WRITE hoodie.insert.shuffle.parallelism: 500 hoodie.datasource.write.operation: upsert hoodie.datasource.hive_sync.partition_extractor_class: org.apache.hudi.hive.MultiPartKeysValueExtractor hoodie.datasource.write.keygenerator.class: org.apache.hudi.keygen.ComplexKeyGenerator -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
