Jefffrey commented on issue #5270: URL: https://github.com/apache/arrow-rs/issues/5270#issuecomment-1874687163
I think this is actually due to https://github.com/apache/arrow-rs/pull/5158 only having recently been merged in. Some reproduction code: ```rust use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder; use parquet::arrow::ArrowWriter; use parquet::errors::Result; use parquet::file::reader::FileReader; use parquet::file::serialized_reader::SerializedFileReader; use std::fs::File; fn main() -> Result<()> { println!("checking file from issue"); let no_stats_path = "/home/jeffrey/Downloads/no_stats.parquet"; let file = File::open(no_stats_path)?; let reader = SerializedFileReader::new(file)?; dbg!(reader.metadata().file_metadata().column_order(0)); println!("checking file after rewritten by pyarrow"); let with_stats_path = "/home/jeffrey/Downloads/with_stats.parquet"; let file = File::open(with_stats_path)?; let reader = SerializedFileReader::new(file)?; dbg!(reader.metadata().file_metadata().column_order(0)); println!("rewriting file from issue with latest parquet-rs"); let file = File::open(no_stats_path)?; let mut reader = ParquetRecordBatchReaderBuilder::try_new(file)?.build()?; let batch = reader.next().unwrap()?; let new_with_stats_path = "/home/jeffrey/Downloads/new_with_stats.parquet"; let file = File::create(new_with_stats_path)?; let mut writer = ArrowWriter::try_new(file, batch.schema(), None)?; writer.write(&batch)?; writer.close()?; println!("checking file after rewritten by latest parquet-rs"); let file = File::open(new_with_stats_path)?; let reader = SerializedFileReader::new(file)?; dbg!(reader.metadata().file_metadata().column_order(0)); Ok(()) } ``` And the output: ``` checking file from issue [parquet/./examples/read_parquet.rs:30] reader.metadata().file_metadata().column_order(0) = UNDEFINED checking file after rewritten by pyarrow [parquet/./examples/read_parquet.rs:36] reader.metadata().file_metadata().column_order(0) = TYPE_DEFINED_ORDER( UNSIGNED, ) rewriting file from issue with latest parquet-rs checking file after rewritten by latest parquet-rs [parquet/./examples/read_parquet.rs:52] reader.metadata().file_metadata().column_order(0) = TYPE_DEFINED_ORDER( UNSIGNED, ) ``` We can see that original `no_stats.parquet` (from here https://github.com/apache/arrow-rs/issues/5270#issuecomment-1873943429) has an undefined column order. After following steps to rewrite that file using pyarrow, can see it is now populated. Now I run with latest master of parquet-rs, rewriting that original file as was done for pyarrow, and can see now the column order is defined. When I check this new parquet file written by latest parquet-rs master branch on pyarrow, statistics are coming through now: ```python >>> pq.ParquetFile("/home/jeffrey/Downloads/new_with_stats.parquet").metadata.row_group(0).column(0) <pyarrow._parquet.ColumnChunkMetaData object at 0x7f2bf82d92b0> file_offset: 61 file_path: physical_type: BYTE_ARRAY num_values: 1 path_in_schema: x is_stats_set: True statistics: <pyarrow._parquet.Statistics object at 0x7f2bf82d94e0> has_min_max: True min: 01 max: 01 null_count: None distinct_count: None num_values: 1 physical_type: BYTE_ARRAY logical_type: String converted_type (legacy): UTF8 compression: UNCOMPRESSED encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY') has_dictionary_page: True dictionary_page_offset: 4 data_page_offset: 24 total_compressed_size: 57 total_uncompressed_size: 57 >>> ``` Could you test again from the arrow-rs master branch, to see if this resolves the issue? This fix should come in arrow-rs release 50.0.0 (not included in 49.0.0 which is the current latest on crates.io), see tracking for the release: https://github.com/apache/arrow-rs/issues/5234 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
