alamb commented on a change in pull request #719:
URL: https://github.com/apache/arrow-datafusion/pull/719#discussion_r669933745



##########
File path: datafusion/src/physical_plan/parquet.rs
##########
@@ -312,22 +431,47 @@ impl ParquetExec {
             if let Some(x) = &part.statistics.column_statistics {
                 let part_nulls: Vec<Option<usize>> =
                     x.iter().map(|c| c.null_count).collect();
-                has_null_counts = true;
+                has_statistics = true;
+
+                let part_max_values: Vec<Option<ScalarValue>> =
+                    x.iter().map(|c| c.max_value.clone()).collect();
+                let part_min_values: Vec<Option<ScalarValue>> =
+                    x.iter().map(|c| c.min_value.clone()).collect();
 
                 for &i in projection.iter() {
                     null_counts[i] = part_nulls[i].unwrap_or(0);
+                    if let Some(part_max_value) = part_max_values[i].clone() {
+                        if let Some(max_value) = &mut max_values[i] {
+                            let _ = max_value.update(&[part_max_value]);

Review comment:
       This doesn't seem right to me -- it seems like it ignores errors if the 
min or max values can't be updated (maybe due to overflow or data type 
mismatch). I think it might be safer if there is an error calling `update()` to 
reset the `min_values[i]` or `max_values[i]` back to None
   
   Otherwise if you have something like
   ```
   Row Group 1: min=1
   Row Group 2: min=MIN_INT <-- and this causes errors for some reason
   Row Group 3: min=3
   ```
   The code as written will return `min =1` for this column in the parquet file 
which may be incorrect. I think a more conservative approach here would be for 
this code to return `None` so we don't erroneously use these statistics
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to