Re: [I] [SUPPORT] data skew when writing with bulk_insert + bucket_index enabled [hudi]

via GitHub Thu, 04 Jul 2024 19:27:48 -0700


ziudu commented on issue #11565:
URL: https://github.com/apache/hudi/issues/11565#issuecomment-2209885176


   trans_code is a randomly generated uuid string: e.g. 
cb1d7307e4e047989955ca544e175c71
   in table tb_transaction_detail there are 96,500,000 records, with 96,500,000 
distinct trans_code values. Each record has a unique trans_code value.
   
   We found a workaround, if "hoodie.index.bucket.engine": 
"CONSISTENT_HASHING", there will be no more data skew or spill during the write 
stage. Consist_hashing is slower though: simple bucket index took 17-18 minutes 
to join and write, even with data skew, while consist hashing took 24 minutes 
to join and write the same data. 
   
   The parallelism during the write stage after join is 320 
(spark.sql.shuffle.partitions) for simple bucket index.
   The parallelism during the write stage after join is 800 (number of parquet 
files in the resulting table) for consistent hashing bucket index. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [SUPPORT] data skew when writing with bulk_insert + bucket_index enabled [hudi]

Reply via email to