alamb commented on code in PR #12032:
URL: https://github.com/apache/datafusion/pull/12032#discussion_r1721659593


##########
datafusion/core/src/datasource/physical_plan/parquet/row_group_filter.rs:
##########
@@ -837,38 +867,70 @@ mod tests {
                 Some(100),
                 Some(600),
                 None,
-                0,
+                Some(0),
                 false,
             )],
         );
         let rgm2 = get_row_group_meta_data(
             &schema_descr,
             // [10, 20]
             // c1 > 5, this row group will be included in the results.
-            vec![ParquetStatistics::int32(Some(10), Some(20), None, 0, false)],
+            vec![ParquetStatistics::int32(
+                Some(10),
+                Some(20),
+                None,
+                Some(0),
+                false,
+            )],
         );
         let rgm3 = get_row_group_meta_data(
             &schema_descr,
             // [0, 2]
             // c1 > 5, this row group will not be included in the results.
-            vec![ParquetStatistics::int32(Some(0), Some(2), None, 0, false)],
+            vec![ParquetStatistics::int32(
+                Some(0),
+                Some(2),
+                None,
+                Some(0),
+                false,
+            )],
         );
         let rgm4 = get_row_group_meta_data(
             &schema_descr,
             // [None, 2]
-            // c1 > 5, this row group can not be filtered out, so will be 
included in the results.
-            vec![ParquetStatistics::int32(None, Some(2), None, 0, false)],
+            // c1 > 5, this row group will also not be included in the results

Review Comment:
   Note this is different and an improvement
   
   What is happening is that previously if *either* min or max was unknown 
`has_statistics_set()` would return false and thus neither min or max was 
reported (basically `Statistics` could not distinguish between having only min 
or max set. 
   
   
https://github.com/apache/arrow-rs/blob/27789d7c9abb50796a4042e7e193703efe3c95b3/parquet/src/file/statistics.rs#L635-L637
   
   After https://github.com/apache/arrow-rs/pull/6216 `Statistics` can 
distinguish between having only one field set and so row group index 3 can be 
pruned (as its max is known to be 2). 
   
   I also added a 5th row group with min set but max unknown to show that is 
correct too



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to