TheR1sing3un opened a new pull request, #13127: URL: https://github.com/apache/hudi/pull/13127
Current `snapshot` reading performance of file slice with base file only has greater rollback than `read_optimized` read performance. The main reason is that `read_optimized` is a chance to turn on vectorized reading parquet, but `snapshot` reads never do vectorized reading. Refer to spark's code: https://github.com/apache/spark/pull/38397 , this behavior seems a little too strict. Because we can separate whether parquet is read as vectorized or not and whether batch is returned. So I modified the code, even if `snapshot` read occurs, but if the slice to read is only the base file, It uses vectorization to read parquet. However, when using `snapshot` to read, the batch result is always set to false, because we can't be sure if there is a file slice that needs to be merged on read time, which is row-based, so the batch result cannot be returned. > Our test case 1. all file slices are base file only 2. 3G per partition > Read with operation: read_optimized <img width="256" alt="image" src="https://github.com/user-attachments/assets/776a0c88-6d63-48ee-a4e3-056970c12368" /> > Before optimization: snapshot_read <img width="277" alt="image" src="https://github.com/user-attachments/assets/a09d84a9-91fc-4e80-a970-a7ad600134d9" /> > After optimization: snapshot_read <img width="248" alt="image" src="https://github.com/user-attachments/assets/529b4410-d643-4f84-9a1b-d18cee23168e" /> ### Change Logs 1. enable vectorized reading for file slice without log file ### Impact improve snapshot read performance when there are some file slices which are base-file-only. ### Risk level (write none, low medium or high below) low ### Documentation Update none ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable - [x] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
