alamb opened a new issue, #7579:
URL: https://github.com/apache/arrow-rs/issues/7579

   **Describe the bug**
   - As @jonded94 found in   #7489 
   - And @etseidl debugged in https://github.com/apache/arrow-rs/pull/7555
   
   When writing long string values into string columns in parqet, we expect the 
`WriterProperties::max_statistics_truncate_length` to be apply and reduce their 
size
   
   This property currently correctly truncates statistics written to the 
ColumnChunkMetadata but *NOT* the statistics written to the data page headers. 
   
   **To Reproduce**
   ```rust
   use std::io::BufWriter;
   use std::sync::Arc;
   use arrow::array::{ArrayRef, RecordBatch, StringViewArray};
   use parquet::arrow::ArrowWriter;
   use parquet::file::properties::WriterProperties;
   
   fn main() {
   
       let output= std::fs::File::create("output.parquet").unwrap();
       let mut output = BufWriter::new(output);
   
       let batch = make_batch('a');
       let props = WriterProperties::builder()
           .set_max_row_group_size(1)
           .set_statistics_truncate_length(Some(64))
           .build();
   
       let mut writer = ArrowWriter::try_new(&mut output, batch.schema(), 
Some(props)).unwrap();
       writer.write(&batch).unwrap();
   
       for char in ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] {
           let batch = make_batch(char);
           writer.write(&batch).unwrap();
       }
       writer.close().unwrap();
   }
   
   // Makes a batch with long string values for testing purposes.
   fn make_batch(val: char) -> RecordBatch {
       let col = Arc::new(StringViewArray::from_iter_values(
           [val.to_string().repeat(100000)]
       )) as ArrayRef;
       RecordBatch::try_from_iter([("col", col)]).unwrap()
   }
   ```
   
   The resulting data page headers have statistics 
   
   **Expected behavior**
   I expect the data page headers to be truncated to 64 bytes
   
   **Additional context**
   - https://github.com/apache/arrow-rs/issues/7490


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to