majian1998 opened a new pull request, #10485:
URL: https://github.com/apache/hudi/pull/10485

   In the current implementation of data skipping, column statistics for the 
entire table are read and then subjected to data skipping filtering operations 
based on these stats. When the table has a large volume of data and a high 
number of partitions, this approach can reduce the efficiency of data skipping, 
as partition pruning conditions are not utilized.
   
   By pushing down the conditions for partition filtering to after the column 
statistics are read and applying pruning at that point, the size of the column 
stats that are subsequently involved in data skipping will be significantly 
reduced. This not only saves time on later computations but also conserves 
memory.
   
   During a test conducted on a table with a total of 22TB distributed across 
60 subpartitions, a query was performed on one of the subpartitions, which was 
1.4TB in size. Overall, this simple test demonstrated that data skipping can 
lead to a savings of several seconds. In scenarios involving partition pruning, 
time savings are indeed achievable. Additionally, there will be a substantial 
reduction in the memory footprint for the list of candidate files needed for 
further computation.
   
   In scenarios where partition pruning is not applied, this query would only 
result in a minimal increase in cost. This minor cost increase is 
inconsequential either when the data volume is large—making these seconds-level 
overheads negligible—or when the data volume is small, eliminating the need for 
partitioning altogether, in which case the filter operation would not be 
time-consuming.
   
   ### Change Logs
   
   Pushing Down Partition Pruning Conditions to Column Stats During Data 
Skipping
   
   ### Impact
   
   None
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   None
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to