[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2854: Fix page size on dictionary fallback

GitBox Sun, 09 Oct 2022 07:52:52 -0700


tustvold commented on code in PR #2854:
URL: https://github.com/apache/arrow-rs/pull/2854#discussion_r990798748



##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -1108,6 +1110,55 @@ mod tests {
         roundtrip(batch, Some(SMALL_SIZE / 2));
     }
 
+    #[test]
+    fn arrow_writer_page_size() {
+        let mut rng = thread_rng();
+        let schema =
+            Arc::new(Schema::new(vec![Field::new("col", DataType::Utf8, 
false)]));
+
+        let mut builder = StringBuilder::with_capacity(1_000, 2 * 1_000);
+
+        for _ in 0..10_000 {
+            let value = (0..200)
+                .map(|_| rng.gen_range(b'a'..=b'z') as char)
+                .collect::<String>();
+
+            builder.append_value(value);
+        }
+
+        let array = Arc::new(builder.finish());
+
+        let batch = RecordBatch::try_new(schema, vec![array]).unwrap();
+
+        let file = tempfile::tempfile().unwrap();
+
+        let props = WriterProperties::builder()
+            .set_max_row_group_size(usize::MAX)
+            .set_data_pagesize_limit(256)

Review Comment:
   You could potentially set the dictionary page size smaller to verify that as 
well, but up to you



##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -1108,6 +1110,55 @@ mod tests {
         roundtrip(batch, Some(SMALL_SIZE / 2));
     }
 
+    #[test]
+    fn arrow_writer_page_size() {
+        let mut rng = thread_rng();

Review Comment:
   I think we should either seed this, or loosen the assert below. Otherwise I 
worry that depending on what values are generated, we may end up with more or 
less pages (as the dictionary page will only spill once it has seen sufficient 
different values, which technically could occur at any point)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2854: Fix page size on dictionary fallback

Reply via email to