usberkeley opened a new pull request, #12170: URL: https://github.com/apache/hudi/pull/12170
### Change Logs The process of sorting Hudi's base file and log file data by primary key is as follows: 1. During the delta commit phase, the data is sorted. 2. During the compaction phase, the data is sorted and merged using merge sort. Delta Commit: The batch data written is sorted by primary key Compaction: [Compaction with Merge Sort Implementation link](https://docs.google.com/presentation/d/10hHkQsd0bCdCor4w9Wduds5l-i_Z7X7A-zk_vYYi4kA/edit#slide=id.p13) ### Impact Sorting the base file and log file data in Hudi offers the following benefits: 1. **Enhancing Hudi Reader Performance**: After sorting the data, merge sort can be used to read the data, thereby avoiding the large memory usage or disk IO overhead associated with Map-based methods. 2. **Improving Compaction Performance**: Sorted data can also utilize merge sort during the merging process, reducing the large memory usage and disk IO overhead required by Map-based methods. 4. **Supporting MDT Introduction of Primary Key Index**: Sorted data facilitates the introduction of spare index similar to those in ClickHouse, thus improving the query efficiency of Hudi Reader. Performance Comparison After Introducing **Ordered Hudi Data** (Including Compaction with Merge Sort and Log Compaction with Merge Sort), **Primary Key Index** and **Secondary Index**: [Performance Comparison link](https://docs.google.com/presentation/d/10hHkQsd0bCdCor4w9Wduds5l-i_Z7X7A-zk_vYYi4kA/edit#slide=id.p19) ### Risk level (write none, low medium or high below) medium ### Documentation Update none ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable - [x] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
