tfiasco opened a new issue #799:
URL: https://github.com/apache/arrow-rs/issues/799


   **Describe the bug**
   a parquet file created by `arrow-rs` has no min_max statistics when reading 
by `pyarrow`.
   
   **To Reproduce**
   ```rust
   // rust code
   
   let id_array = Int32Array::from(vec![1, 2, 3, 4, 5]);
   let id_array2 = Int32Array::from(vec![2, 3, 4, 5, 6]);
   let schema = Arc::new(Schema::new(vec![
       Field::new("id", DataType::Int32, false),
       Field::new("id2", DataType::Int32, false),
   ]));
   
   let batch = RecordBatch::try_new(
       schema.clone(),
       vec![Arc::new(id_array), Arc::new(id_array2)],
   )
   .unwrap();
   
   let writer_properties = WriterProperties::builder()
       .set_compression(Compression::ZSTD)
       .set_statistics_enabled(true)
       .build();
   
   let path = "/.../test.parquet";
   let file = fs::File::create(&path).unwrap();
   
   let mut writer = ArrowWriter::try_new(file, schema.clone(), 
Some(writer_properties)).unwrap();
   writer.write(&batch).unwrap();
   writer.close().unwrap();
   
   let file2 = fs::File::open(&path).unwrap();
   
   let file_reader = SerializedFileReader::new(file2).unwrap();
   let mut arrow_reader = ParquetFileArrowReader::new(Arc::new(file_reader));
   
   println!(
       "statistics: {:?}",
       arrow_reader
           .get_metadata()
           .row_group(0)
           .column(0)
           .statistics()
   );
   println!(
       "statistics: {:?}",
       arrow_reader
           .get_metadata()
           .row_group(0)
           .column(1)
           .statistics()
   );
   
   // output: 
   // statistics: Some(Int32({min: Some(1), max: Some(5), distinct_count: None, 
null_count: 0, min_max_deprecated: false}))
   // statistics: Some(Int32({min: Some(2), max: Some(6), distinct_count: None, 
null_count: 0, min_max_deprecated: false}))
   ```
   
   ```python
   # python code
   
   import pyarrow.parquet as pq
   f = pq.ParquetFile('./test.parquet')
   print(f.metadata.row_group(0).column(0).statistics)
   
   # output:
   """
   <pyarrow._parquet.Statistics object at 0x7fbf8d409dd0>
     has_min_max: False
     min: None
     max: None
     null_count: 0
     distinct_count: 0
     num_values: 5
     physical_type: INT32
     logical_type: None
     converted_type (legacy): NONE
   """
   ```
   
   **Expected behavior**
   pyarrow should get statistics like
   ```
     has_min_max: True
     min: 1
     max: 5
   ```
   
   **Additional context**
   rust lib version:
   ```
   parquet = "5.4.0"
   arrow = "5.4.0"
   ```
   
   python lib version:
   ```
   pyarrow==5.0.0
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to