ziudu commented on issue #11565: URL: https://github.com/apache/hudi/issues/11565#issuecomment-2209885176
trans_code is a randomly generated uuid string: e.g. cb1d7307e4e047989955ca544e175c71 in table tb_transaction_detail there are 96,500,000 records, with 96,500,000 distinct trans_code values. Each record has a unique trans_code value. We found a workaround, if "hoodie.index.bucket.engine": "CONSISTENT_HASHING", there will be no more data skew or spill during the write stage. Consist_hashing is slower though: simple bucket index took 17-18 minutes to join and write, even with data skew, while consist hashing took 24 minutes to join and write the same data. The parallelism during the write stage after join is 320 (spark.sql.shuffle.partitions) for simple bucket index. The parallelism during the write stage after join is 800 (number of parquet files in the resulting table) for consistent hashing bucket index. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
