jtm437 opened a new issue, #6687:
URL: https://github.com/apache/hudi/issues/6687

   Hello, I am having performance issues when attempting to upsert data into a 
Hudi COW table. With the below specs it is taking longer than 4 hours to finish 
upserting (if it ever does finish). In the screenshots below, you can see that 
it is taking a long time doing the index scan. I have tried disabling 
hoodie.bloom.index.prune.by.ranges because our record key is random. I've also 
tried upserting using the "Simple" index type and did not see any performance 
improvements. Is there anything else I can do to improve the performance?
   
   <img width="1674" alt="image" 
src="https://user-images.githubusercontent.com/38168120/190529250-5596f90f-ef5c-4523-aad1-675de90cb5b4.png";>
   <img width="1678" alt="image" 
src="https://user-images.githubusercontent.com/38168120/190529310-bf33d109-3ef2-4232-8ef6-24260cca5d5f.png";>
   
   **Specs:**
   Table Size: 13.6TB (compressed in S3)
   Number of partitions: 1135 
(hoodie.datasource.hive_sync.partition_fields=year,month)
   Upsert dataset size:  390 million records, 21GB compressed
   Index type: Default (Bloom)
   Number of nodes: 30
   Node type: r6g.8xlarge
   
   **Environment Description**
   
   * Hudi version : 0.10
   * Spark version : 2.4.8
   * EMR version: 5.34.0
   * Hive version : 2.38.0
   * Hadoop version : Amazon 2.10.1
   * Storage (HDFS/S3/GCS..) : S3
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to