alamb opened a new issue #641:
URL: https://github.com/apache/arrow-rs/issues/641


   **Describe the bug**
   The statistics written by the arrow / parquet writer for String columns seem 
to be incorrect. 
   
   **To Reproduce**
   Run this code:
   
   ```rust
   
   fn main() {
       let input = vec![
           Some("andover"),
           Some("reading"),
           Some("bedford"),
           Some("tewsbury"),
           Some("lexington"),
           Some("lawrence"),
       ];
   
       let input: StringArray = input.into_iter().collect();
       println!("Staring to test with array {:?}", input);
   
       let record_batch = RecordBatch::try_from_iter(vec![
           ("city", Arc::new(input) as _)
       ]).unwrap();
   
       println!("Opening output file /tmp/test.parquet");
       let out_file = File::create("/tmp/test.parquet").unwrap();
   
       println!("Creating writer...");
       let mut writer = ArrowWriter::try_new(out_file, record_batch.schema(), 
None)
           .expect("creating writer");
   
       println!("writing...");
       writer.write(&record_batch).expect("writing");
   
       println!("closing...");
       writer.close().expect("closing");
   
       println!("done...");
   }
   ```
   
   Then examine the resulting parquet file and note the min/max values for the 
"city" column are:
   ```
   min: "andover"
   max: "lexington"
   ```
   
   ```shell
   alamb@MacBook-Pro rust_parquet % parquet-tools dump  /tmp/test.parquet 
   parquet-tools dump  /tmp/test.parquet 
   row group 0 
   
------------------------------------------------------------------------------------------------------------------------------
   city:  BINARY UNCOMPRESSED DO:4 FPO:90 SZ:130/130/1.00 VC:6 
ENC:RLE_DICTIONARY,PLAIN,RLE ST:[min: andover, max: lexi [more]...
   
       city TV=6 RL=0 DL=0 DS: 6 DE:PLAIN
       
--------------------------------------------------------------------------------------------------------------------------
       page 0:                  DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: 
andover, max: lexington, num_nulls not defined] [more]... VC:6
   
   BINARY city 
   
------------------------------------------------------------------------------------------------------------------------------
   *** row group 1 of 1, values 1 to 6 *** 
   value 1: R:0 D:0 V:andover
   value 2: R:0 D:0 V:reading
   value 3: R:0 D:0 V:bedford
   value 4: R:0 D:0 V:tewsbury
   value 5: R:0 D:0 V:lexington
   value 6: R:0 D:0 V:lawrence
   ```
   
   **Expected behavior**
   The parquet file produced has min/max statistics for the city column:
   ```
   min: "andover"
   max: "tewsbury"
   ```
   
   As 't' follows 'l'
   
   **Additional context**
   
   Since DataFusion now uses these statistics for pruning out row groups, this 
leads to incorrect results in DataFusion.  I found this when investigating 
https://github.com/influxdata/influxdb_iox/issues/2153


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to