Jefffrey commented on issue #5270:
URL: https://github.com/apache/arrow-rs/issues/5270#issuecomment-1874687163

   I think this is actually due to https://github.com/apache/arrow-rs/pull/5158 
only having recently been merged in.
   
   Some reproduction code:
   
   ```rust
   use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
   use parquet::arrow::ArrowWriter;
   use parquet::errors::Result;
   use parquet::file::reader::FileReader;
   use parquet::file::serialized_reader::SerializedFileReader;
   use std::fs::File;
   
   fn main() -> Result<()> {
       println!("checking file from issue");
       let no_stats_path = "/home/jeffrey/Downloads/no_stats.parquet";
       let file = File::open(no_stats_path)?;
       let reader = SerializedFileReader::new(file)?;
       dbg!(reader.metadata().file_metadata().column_order(0));
   
       println!("checking file after rewritten by pyarrow");
       let with_stats_path = "/home/jeffrey/Downloads/with_stats.parquet";
       let file = File::open(with_stats_path)?;
       let reader = SerializedFileReader::new(file)?;
       dbg!(reader.metadata().file_metadata().column_order(0));
   
       println!("rewriting file from issue with latest parquet-rs");
       let file = File::open(no_stats_path)?;
       let mut reader = 
ParquetRecordBatchReaderBuilder::try_new(file)?.build()?;
       let batch = reader.next().unwrap()?;
   
       let new_with_stats_path = 
"/home/jeffrey/Downloads/new_with_stats.parquet";
       let file = File::create(new_with_stats_path)?;
       let mut writer = ArrowWriter::try_new(file, batch.schema(), None)?;
       writer.write(&batch)?;
       writer.close()?;
   
       println!("checking file after rewritten by latest parquet-rs");
       let file = File::open(new_with_stats_path)?;
       let reader = SerializedFileReader::new(file)?;
       dbg!(reader.metadata().file_metadata().column_order(0));
   
       Ok(())
   }
   ```
   
   And the output:
   
   ```
   checking file from issue
   [parquet/./examples/read_parquet.rs:30] 
reader.metadata().file_metadata().column_order(0) = UNDEFINED
   checking file after rewritten by pyarrow
   [parquet/./examples/read_parquet.rs:36] 
reader.metadata().file_metadata().column_order(0) = TYPE_DEFINED_ORDER(
       UNSIGNED,
   )
   rewriting file from issue with latest parquet-rs
   checking file after rewritten by latest parquet-rs
   [parquet/./examples/read_parquet.rs:52] 
reader.metadata().file_metadata().column_order(0) = TYPE_DEFINED_ORDER(
       UNSIGNED,
   )
   ```
   
   We can see that original `no_stats.parquet` (from here 
https://github.com/apache/arrow-rs/issues/5270#issuecomment-1873943429) has an 
undefined column order. After following steps to rewrite that file using 
pyarrow, can see it is now populated.
   
   Now I run with latest master of parquet-rs, rewriting that original file as 
was done for pyarrow, and can see now the column order is defined.
   
   When I check this new parquet file written by latest parquet-rs master 
branch on pyarrow, statistics are coming through now:
   
   ```python
   >>> 
pq.ParquetFile("/home/jeffrey/Downloads/new_with_stats.parquet").metadata.row_group(0).column(0)
   <pyarrow._parquet.ColumnChunkMetaData object at 0x7f2bf82d92b0>
     file_offset: 61
     file_path:
     physical_type: BYTE_ARRAY
     num_values: 1
     path_in_schema: x
     is_stats_set: True
     statistics:
       <pyarrow._parquet.Statistics object at 0x7f2bf82d94e0>
         has_min_max: True
         min: 01
         max: 01
         null_count: None
         distinct_count: None
         num_values: 1
         physical_type: BYTE_ARRAY
         logical_type: String
         converted_type (legacy): UTF8
     compression: UNCOMPRESSED
     encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
     has_dictionary_page: True
     dictionary_page_offset: 4
     data_page_offset: 24
     total_compressed_size: 57
     total_uncompressed_size: 57
   >>>
   ```
   
   Could you test again from the arrow-rs master branch, to see if this 
resolves the issue?
   
   This fix should come in arrow-rs release 50.0.0 (not included in 49.0.0 
which is the current latest on crates.io), see tracking for the release: 
https://github.com/apache/arrow-rs/issues/5234


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to