usberkeley opened a new pull request, #12170:
URL: https://github.com/apache/hudi/pull/12170

   ### Change Logs
   
   The process of sorting Hudi's base file and log file data by primary key is 
as follows:
   
   1. During the delta commit phase, the data is sorted.
   2. During the compaction phase, the data is sorted and merged using merge 
sort.
   
   Delta Commit: The batch data written is sorted by primary key
   Compaction: [Compaction with Merge Sort Implementation 
link](https://docs.google.com/presentation/d/10hHkQsd0bCdCor4w9Wduds5l-i_Z7X7A-zk_vYYi4kA/edit#slide=id.p13)
   
   ### Impact
   
   Sorting the base file and log file data in Hudi offers the following 
benefits:
   
   1. **Enhancing Hudi Reader Performance**: After sorting the data, merge sort 
can be used to read the data, thereby avoiding the large memory usage or disk 
IO overhead associated with Map-based methods.
   2. **Improving Compaction Performance**: Sorted data can also utilize merge 
sort during the merging process, reducing the large memory usage and disk IO 
overhead required by Map-based methods.
   4. **Supporting MDT Introduction of Primary Key Index**: Sorted data 
facilitates the introduction of spare index similar to those in ClickHouse, 
thus improving the query efficiency of Hudi Reader.
   
   Performance Comparison After Introducing **Ordered Hudi Data** (Including 
Compaction with Merge Sort and Log Compaction with Merge Sort), **Primary Key 
Index** and **Secondary Index**: [Performance Comparison 
link](https://docs.google.com/presentation/d/10hHkQsd0bCdCor4w9Wduds5l-i_Z7X7A-zk_vYYi4kA/edit#slide=id.p19)
   
   ### Risk level (write none, low medium or high below)
   
   medium
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [x] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to