tustvold commented on code in PR #2854: URL: https://github.com/apache/arrow-rs/pull/2854#discussion_r990798748
########## parquet/src/arrow/arrow_writer/mod.rs: ########## @@ -1108,6 +1110,55 @@ mod tests { roundtrip(batch, Some(SMALL_SIZE / 2)); } + #[test] + fn arrow_writer_page_size() { + let mut rng = thread_rng(); + let schema = + Arc::new(Schema::new(vec![Field::new("col", DataType::Utf8, false)])); + + let mut builder = StringBuilder::with_capacity(1_000, 2 * 1_000); + + for _ in 0..10_000 { + let value = (0..200) + .map(|_| rng.gen_range(b'a'..=b'z') as char) + .collect::<String>(); + + builder.append_value(value); + } + + let array = Arc::new(builder.finish()); + + let batch = RecordBatch::try_new(schema, vec![array]).unwrap(); + + let file = tempfile::tempfile().unwrap(); + + let props = WriterProperties::builder() + .set_max_row_group_size(usize::MAX) + .set_data_pagesize_limit(256) Review Comment: You could potentially set the dictionary page size smaller to verify that as well, but up to you ########## parquet/src/arrow/arrow_writer/mod.rs: ########## @@ -1108,6 +1110,55 @@ mod tests { roundtrip(batch, Some(SMALL_SIZE / 2)); } + #[test] + fn arrow_writer_page_size() { + let mut rng = thread_rng(); Review Comment: I think we should either seed this, or loosen the assert below. Otherwise I worry that depending on what values are generated, we may end up with more or less pages (as the dictionary page will only spill once it has seen sufficient different values, which technically could occur at any point) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org