zhongyujiang opened a new pull request, #6893:
URL: https://github.com/apache/iceberg/pull/6893

   We found that Parquet row-group filters may not work well sometimes, 
specifically, when evaluating expressions connected by OR and if the child 
expressions of this OR expression can only be evaluated by different row-group 
filters. 
   
   For example, suppose we have a sorted column `foo`, its null values are all 
clustered together after sorting,so queries like `foo IS NULL` can filter out 
most of the data. But when we want to combine other conditions to query, for 
example: `bar IN (x, y, z) OR foo IS NULL`(column `bar` is not sorted), row 
group filters can't work well, we found this is because that 
`ParquetMetricRowGroupFilter` has poor effect on evaluating `bar IN (x, y, z)` 
while at the same time `ParquetDictionaryRowGroupFilter` cannot answer `foo IS 
NULL` because Parquet dictionary has no nulls stats. I guess this also happens 
when one child node of OR can only be answered by `ParquetBloomRowGroupFilter` 
but the other can only be answered by `ParquetMetricRowGroupFilter` or 
`ParquetDictionaryRowGroupFilter`.

   
   This PR tries to solve this kind of issue. It borrows the idea of 
`ResidualEvaluator`, allowing row-group filters to eliminate those predicates 
that can get ROWS_CANNOT_MATCH / ROWS_ALL_MATCH conclusions during the 
evaluation process, so that an expression can be evaluated for residuals, which 
is then passed to the next row-group filter for evaluation. In this way, it 
makes three row-group filters to work together to evaluate an expression. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to