pjmore commented on issue #799: URL: https://github.com/apache/arrow-rs/issues/799#issuecomment-1019666366
@tfiasco I think this is due to the c++ implementation only using min_value and max_value if a column order is set: https://github.com/apache/arrow/blob/54460d96ba1d613e472d8d9a96c072147e736b4d/cpp/src/parquet/metadata.cc#L82 Where for your example the current implementation, I used version 7, a modified version of your snippet prints None: ``` println!( "statistics: {:?}", &arrow_reader .get_metadata() .row_group(0) .to_thrift() .columns[0] .meta_data.as_ref() ); println!( "statistics: {:?}", &arrow_reader .get_metadata() .row_group(0) .to_thrift() .columns[1] .meta_data.as_ref() ); println!( "column_orders: {:?}", &arrow_reader .get_metadata() .file_metadata() .column_orders() ); ``` This prints ``` statistics: Some(ColumnMetaData { type_: Int32, encodings: [Plain, RleDictionary, Rle], path_in_schema: ["id"], codec: Zstd, num_values: 5, total_uncompressed_size: 70, total_compressed_size: 88, key_value_metadata: None, data_page_offset: 47, index_page_offset: None, dictionary_page_offset: Some(4), statistics: Some(Statistics { max: None, min: None, null_count: None, distinct_count: None, max_value: Some([5, 0, 0, 0]), min_value: Some([1, 0, 0, 0]) }), encoding_stats: None, bloom_filter_offset: None }) statistics: Some(ColumnMetaData { type_: Int32, encodings: [Plain, RleDictionary, Rle], path_in_schema: ["id2"], codec: Zstd, num_values: 5, total_uncompressed_size: 70, total_compressed_size: 88, key_value_metadata: None, data_page_offset: 181, index_page_offset: None, dictionary_page_offset: Some(138), statistics: Some(Statistics { max: None, min: None, null_count: None, distinct_count: None, max_value: Some([6, 0, 0, 0]), min_value: Some([2, 0, 0, 0]) }), encoding_stats: None, bloom_filter_offset: None }) column_orders: None ``` So it looks like the c++ implementation is just ignoring the current stats values. While digging through the code to figure this out I found a comment in parquet_format that said that without column_orders the meaning of min_value and max_value is undefined. If the comment is accurate this seems like a bug in the current implementation that the min_value and max_value are being used the way that they are. The comment in question is: ``` /// Sort order used for the min_value and max_value fields of each column in /// this file. Sort orders are listed in the order matching the columns in the /// schema. The indexes are not necessary the same though, because only leaf /// nodes of the schema are represented in the list of sort orders. /// /// Without column_orders, the meaning of the min_value and max_value fields is /// undefined. To ensure well-defined behaviour, if min_value and max_value are /// written to a Parquet file, column_orders must be written as well. /// /// The obsolete min and max fields are always sorted by signed comparison /// regardless of column_orders. ``` https://github.com/sunchao/parquet-format-rs/blob/b0d5bcb51a919837310c7dccd5141ea956346357/src/parquet_format.rs#L4919 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
