Re: [PR] Implement the contained method of RowGroupPruningStatistics [arrow-datafusion]

via GitHub Thu, 11 Jan 2024 04:35:51 -0800


Ted-Jiang commented on code in PR #8669:
URL: https://github.com/apache/arrow-datafusion/pull/8669#discussion_r1448792524



##########
datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs:
##########
@@ -276,15 +278,74 @@ impl<'a> PruningStatistics for 
RowGroupPruningStatistics<'a> {
         scalar.to_array().ok()
     }
 
+    /// The basic idea is to check whether all of the `values` are not within 
the min-max boundary.
+    /// If any one value is within the min-max boundary, then this row group 
will not be skipped.
+    /// Otherwise, this row group will be able to be skipped.
     fn contained(
         &self,
-        _column: &Column,
-        _values: &HashSet<ScalarValue>,
+        column: &Column,
+        values: &HashSet<ScalarValue>,
     ) -> Option<BooleanArray> {
-        None
+        let min_values = self.min_values(column)?;
+        let max_values = self.max_values(column)?;
+        // The boundary should be with length of 1
+        if min_values.len() != max_values.len() || min_values.len() != 1 {
+            return None;
+        }
+        let min_value = ScalarValue::try_from_array(min_values.as_ref(), 
0).ok()?;
+        let max_value = ScalarValue::try_from_array(max_values.as_ref(), 
0).ok()?;
+
+        // The boundary should be with the same data type
+        if min_value.data_type() != max_value.data_type() {
+            return None;
+        }
+        let target_data_type = min_value.data_type();
+
+        let (c, _) = self.column(&column.name)?;
+        let has_null = c.statistics()?.null_count() > 0;
+        let mut known_not_present = true;
+        for value in values {
+            // If it's null, check whether the null exists from the statistics

Review Comment:
   I think `col in (NULL)` will not match any thing,  same as `col = null`, 
which means `col in (a,b,c)` same as `col in (a,b,c, null)`. is there any rule 
to remove the null out of in list 🤔 @alamb  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Implement the contained method of RowGroupPruningStatistics [arrow-datafusion]

Reply via email to