alamb commented on code in PR #9129:
URL: https://github.com/apache/arrow-datafusion/pull/9129#discussion_r1485323984


##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -369,6 +454,29 @@ fn summarize_min_max(
                     .unwrap_or_else(|_| min_values[i] = None);
             }
         }
+
+        ParquetStatistics::ByteArray(s)
+            if matches!(fields[i].data_type(), DataType::Utf8 | 
DataType::LargeUtf8) =>
+        {
+            if let Some(max_value) = &mut max_values[i] {

Review Comment:
   I believe byte arrays are also used to store `DataType::Decimal` values as 
well (though hopefully if we consolidate the statistics conversion code it will 
"just work")



##########
datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs:
##########
@@ -1003,6 +1006,246 @@ mod tests {
         );
     }
 
+    #[test]
+    fn row_group_pruning_predicate_utf8() {

Review Comment:
   I believe the tests in this module are for row group pruning which use the 
statistics extraction code in 
   
https://github.com/apache/arrow-datafusion/blob/6c4109017edfe10800e0ffee8c1c254aade05849/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L58-L57,
 which confusingly isn't the same code used to extract statistics for the 
entire file.
   
   A way to test this might be to create a parquet exec to read 
`alltypes_plain.parquet'` and verify that statistics are present
   
   For example, I think this information is encoded in the 
`physical_plan_with_stats` line like this
   
   ```
   
[(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:),(Col[5]:),(Col[6]:),(Col[7]:),(Col[8]:),(Col[9]:),(Col[10]:)]]
     
   ```
   
   
   ```
   ❯ explain verbose select * from 
'./parquet-testing/data/alltypes_plain.parquet';
   ....
   | physical_plan_with_stats                                   | ParquetExec: 
file_groups={1 group: 
[[Users/andrewlamb/Software/arrow-datafusion/parquet-testing/data/alltypes_plain.parquet]]},
 projection=[id, bool_col, tinyint_col, smallint_col, int_col, bigint_col, 
float_col, double_col, date_string_col, string_col, timestamp_col], 
statistics=[Rows=Exact(8), Bytes=Absent, 
[(Col[0]:),(Col[1]:),(Col[2]:),(Col[3]:),(Col[4]:),(Col[5]:),(Col[6]:),(Col[7]:),(Col[8]:),(Col[9]:),(Col[10]:)]]
                                                                                
                                                                                
                                                              |
   |                                                            |               
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                        |
   
+------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to