gudladona commented on PR #18241: URL: https://github.com/apache/hudi/pull/18241#issuecomment-3961432830
> > After this fix, this mode of clustering succeeds and meets(exceeds) the intended performance of this feature. > > nice contribution~ @gudladona, look forward to more details about the perf gains. @danny0405 As you can see below on a test job clustering that has been running for a week. You can notice when we changed to this clustering strategy. There are surely caveats to this binary/stream copy based approach. It creates new files with many rowgroups with uneven sizes. Although it reduces file count, it can significantly increase file splits. This can be somewhat good for parallelism, but it has several disadvantages. The way we plan to use this is to allow a Minor compaction using rowbased clustering that deserializes/decompress and then create new files via parquet writer until 64-128 MB of max file size and then let binary copy take over with a Major compaction that just does stitching of rowgroups, which is very quick. I would appreciate some guidance if this approach has any quirks. <img width="2393" height="236" alt="image" src="https://github.com/user-attachments/assets/6316cf04-808e-4263-bb93-189857e294a9" /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
