pjmore commented on issue #799:
URL: https://github.com/apache/arrow-rs/issues/799#issuecomment-1019666366


   @tfiasco I think this is due to the c++ implementation only using min_value 
and max_value if a column order is set: 
https://github.com/apache/arrow/blob/54460d96ba1d613e472d8d9a96c072147e736b4d/cpp/src/parquet/metadata.cc#L82
   Where for your example the current implementation, I used version 7, a 
modified version of your snippet prints None:
   ```    
   println!(
           "statistics: {:?}",
           &arrow_reader
               .get_metadata()
               .row_group(0)
               .to_thrift()
               .columns[0]
               .meta_data.as_ref()
       );
       println!(
           "statistics: {:?}",
           &arrow_reader
               .get_metadata()
               .row_group(0)
               .to_thrift()
               .columns[1]
               .meta_data.as_ref()
       );
       println!(
           "column_orders: {:?}",
           &arrow_reader
               .get_metadata()
               .file_metadata()
               .column_orders()
       );
    ```
   This prints 
   ```
   statistics: Some(ColumnMetaData { type_: Int32, encodings: [Plain, 
RleDictionary, Rle], path_in_schema: ["id"], codec: Zstd, num_values: 5, 
total_uncompressed_size: 70, total_compressed_size: 88, key_value_metadata: 
None, data_page_offset: 47, index_page_offset: None, dictionary_page_offset: 
Some(4), statistics: Some(Statistics { max: None, min: None, null_count: None, 
distinct_count: None, max_value: Some([5, 0, 0, 0]), min_value: Some([1, 0, 0, 
0]) }), encoding_stats: None, bloom_filter_offset: None })
   statistics: Some(ColumnMetaData { type_: Int32, encodings: [Plain, 
RleDictionary, Rle], path_in_schema: ["id2"], codec: Zstd, num_values: 5, 
total_uncompressed_size: 70, total_compressed_size: 88, key_value_metadata: 
None, data_page_offset: 181, index_page_offset: None, dictionary_page_offset: 
Some(138), statistics: Some(Statistics { max: None, min: None, null_count: 
None, distinct_count: None, max_value: Some([6, 0, 0, 0]), min_value: Some([2, 
0, 0, 0]) }), encoding_stats: None, bloom_filter_offset: None })
   column_orders: None
   ```
   So it looks like the c++ implementation is just ignoring the current stats 
values. While digging through the code to figure this out I found a comment in 
parquet_format that said that without column_orders the meaning of min_value 
and max_value is undefined. If the comment is accurate this seems like a bug in 
the current implementation that the min_value and max_value are being used the 
way that they are. The comment in question is:
   ```
     /// Sort order used for the min_value and max_value fields of each column 
in
     /// this file. Sort orders are listed in the order matching the columns in 
the
     /// schema. The indexes are not necessary the same though, because only 
leaf
     /// nodes of the schema are represented in the list of sort orders.
     /// 
     /// Without column_orders, the meaning of the min_value and max_value 
fields is
     /// undefined. To ensure well-defined behaviour, if min_value and 
max_value are
     /// written to a Parquet file, column_orders must be written as well.
     /// 
     /// The obsolete min and max fields are always sorted by signed comparison
     /// regardless of column_orders.
   ```
   
https://github.com/sunchao/parquet-format-rs/blob/b0d5bcb51a919837310c7dccd5141ea956346357/src/parquet_format.rs#L4919


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to