alamb opened a new issue #641: URL: https://github.com/apache/arrow-rs/issues/641
**Describe the bug** The statistics written by the arrow / parquet writer for String columns seem to be incorrect. **To Reproduce** Run this code: ```rust fn main() { let input = vec![ Some("andover"), Some("reading"), Some("bedford"), Some("tewsbury"), Some("lexington"), Some("lawrence"), ]; let input: StringArray = input.into_iter().collect(); println!("Staring to test with array {:?}", input); let record_batch = RecordBatch::try_from_iter(vec![ ("city", Arc::new(input) as _) ]).unwrap(); println!("Opening output file /tmp/test.parquet"); let out_file = File::create("/tmp/test.parquet").unwrap(); println!("Creating writer..."); let mut writer = ArrowWriter::try_new(out_file, record_batch.schema(), None) .expect("creating writer"); println!("writing..."); writer.write(&record_batch).expect("writing"); println!("closing..."); writer.close().expect("closing"); println!("done..."); } ``` Then examine the resulting parquet file and note the min/max values for the "city" column are: ``` min: "andover" max: "lexington" ``` ```shell alamb@MacBook-Pro rust_parquet % parquet-tools dump /tmp/test.parquet parquet-tools dump /tmp/test.parquet row group 0 ------------------------------------------------------------------------------------------------------------------------------ city: BINARY UNCOMPRESSED DO:4 FPO:90 SZ:130/130/1.00 VC:6 ENC:RLE_DICTIONARY,PLAIN,RLE ST:[min: andover, max: lexi [more]... city TV=6 RL=0 DL=0 DS: 6 DE:PLAIN -------------------------------------------------------------------------------------------------------------------------- page 0: DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: andover, max: lexington, num_nulls not defined] [more]... VC:6 BINARY city ------------------------------------------------------------------------------------------------------------------------------ *** row group 1 of 1, values 1 to 6 *** value 1: R:0 D:0 V:andover value 2: R:0 D:0 V:reading value 3: R:0 D:0 V:bedford value 4: R:0 D:0 V:tewsbury value 5: R:0 D:0 V:lexington value 6: R:0 D:0 V:lawrence ``` **Expected behavior** The parquet file produced has min/max statistics for the city column: ``` min: "andover" max: "tewsbury" ``` As 't' follows 'l' **Additional context** Since DataFusion now uses these statistics for pruning out row groups, this leads to incorrect results in DataFusion. I found this when investigating https://github.com/influxdata/influxdb_iox/issues/2153 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org