TheR1sing3un opened a new pull request, #13127:
URL: https://github.com/apache/hudi/pull/13127

   Current `snapshot` reading performance of file slice with base file only has 
greater rollback than `read_optimized` read performance.
   The main reason is that `read_optimized` is a chance to turn on vectorized 
reading parquet, but `snapshot` reads never do vectorized reading. Refer to 
spark's code: https://github.com/apache/spark/pull/38397 , this behavior seems 
a little too strict. Because we can separate whether  parquet is read as 
vectorized or not and whether batch is returned. 
   So I modified the code, even if `snapshot` read occurs, but if the slice to 
read is only the base file, It uses vectorization to read parquet. However, 
when using `snapshot` to read, the batch result is always set to false, because 
we can't be sure if there is a file slice that needs to be merged on read time, 
which is row-based, so the batch result cannot be returned.
   
   > Our test case
   
   1. all file slices are base file only
   2. 3G per partition
   
   > Read with operation: read_optimized
   
   <img width="256" alt="image" 
src="https://github.com/user-attachments/assets/776a0c88-6d63-48ee-a4e3-056970c12368";
 />
   
   
   > Before optimization: snapshot_read
   
   <img width="277" alt="image" 
src="https://github.com/user-attachments/assets/a09d84a9-91fc-4e80-a970-a7ad600134d9";
 />
   
   > After optimization: snapshot_read
   
   <img width="248" alt="image" 
src="https://github.com/user-attachments/assets/529b4410-d643-4f84-9a1b-d18cee23168e";
 />
   
   
   
   ### Change Logs
   
   1. enable vectorized reading for file slice without log file
   
   
   ### Impact
   
   improve snapshot read performance when there are some file slices which are 
base-file-only.
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [x] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to