Re: [PR] fix: [HUDI-CLUSTERING] Optimize binary copy performance with lazy loading, bulk reads, and double buffering [hudi]

via GitHub Wed, 25 Feb 2026 11:14:49 -0800


gudladona commented on PR #18241:
URL: https://github.com/apache/hudi/pull/18241#issuecomment-3961432830


   > > After this fix, this mode of clustering succeeds and meets(exceeds) the 
intended performance of this feature.
   > 
   > nice contribution~ @gudladona, look forward to more details about the perf 
gains.
   
   @danny0405 As you can see below on a test job clustering that has been 
running for a week. You can notice when we changed to this clustering strategy. 
   
   There are surely caveats to this binary/stream copy based approach. It 
creates new files with many rowgroups with uneven sizes. Although it reduces 
file count, it can significantly increase file splits. This can be somewhat 
good for parallelism, but it has several disadvantages. The way we plan to use 
this is to allow a Minor compaction using rowbased clustering that 
deserializes/decompress and then create new files via parquet writer until 
64-128 MB of max file size and then let binary copy take over with a Major 
compaction that just does stitching of rowgroups, which is very quick. 
   
   I would appreciate some guidance if this approach has any quirks.
   
   
   <img width="2393" height="236" alt="image" 
src="https://github.com/user-attachments/assets/6316cf04-808e-4263-bb93-189857e294a9";
 />
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix: [HUDI-CLUSTERING] Optimize binary copy performance with lazy loading, bulk reads, and double buffering [hudi]

Reply via email to