Ted-Jiang commented on code in PR #8669:
URL: https://github.com/apache/arrow-datafusion/pull/8669#discussion_r1448792524
##########
datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs:
##########
@@ -276,15 +278,74 @@ impl<'a> PruningStatistics for
RowGroupPruningStatistics<'a> {
scalar.to_array().ok()
}
+ /// The basic idea is to check whether all of the `values` are not within
the min-max boundary.
+ /// If any one value is within the min-max boundary, then this row group
will not be skipped.
+ /// Otherwise, this row group will be able to be skipped.
fn contained(
&self,
- _column: &Column,
- _values: &HashSet<ScalarValue>,
+ column: &Column,
+ values: &HashSet<ScalarValue>,
) -> Option<BooleanArray> {
- None
+ let min_values = self.min_values(column)?;
+ let max_values = self.max_values(column)?;
+ // The boundary should be with length of 1
+ if min_values.len() != max_values.len() || min_values.len() != 1 {
+ return None;
+ }
+ let min_value = ScalarValue::try_from_array(min_values.as_ref(),
0).ok()?;
+ let max_value = ScalarValue::try_from_array(max_values.as_ref(),
0).ok()?;
+
+ // The boundary should be with the same data type
+ if min_value.data_type() != max_value.data_type() {
+ return None;
+ }
+ let target_data_type = min_value.data_type();
+
+ let (c, _) = self.column(&column.name)?;
+ let has_null = c.statistics()?.null_count() > 0;
+ let mut known_not_present = true;
+ for value in values {
+ // If it's null, check whether the null exists from the statistics
Review Comment:
I think `col in (NULL)` will not match any thing, same as `col = null`,
which means `col in (a,b,c)` same as `col in (a,b,c, null)`. is there any rule
to remove the null out of in list 🤔 @alamb
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]