[PR] [core][spark] Use DV-aware tight bounds for Spark MIN/MAX pushdown [paimon]

via GitHub Sun, 31 May 2026 02:09:21 -0700


kerwin-zk opened a new pull request, #8047:
URL: https://github.com/apache/paimon/pull/8047


   ### Purpose
   Spark currently disables MIN/MAX aggregate pushdown for any table with 
`deletion-vectors.enabled=true`. This is correct but too conservative: many 
DV-enabled non-primary-key tables, or many splits inside them, do not actually 
have deleted rows. In those cases the recorded file min/max stats are still 
tight and can safely answer MIN/MAX.
   
   This PR makes the decision based on runtime split metadata instead of the 
table-level DV option. It derives whether a data file still has tight stats 
from `DataFileMeta.deleteRowCount` and the paired `DeletionFile.cardinality`, 
then allows Spark MIN/MAX pushdown only when every file in the split is tight.
   
   This keeps the existing safety behavior for files with real deletes or 
unknown DV cardinality, while recovering MIN/MAX pushdown for DV-enabled 
tables/splits that have no effective deletions.
   
   ### Tests
   CI
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [core][spark] Use DV-aware tight bounds for Spark MIN/MAX pushdown [paimon]

Reply via email to