Re: [PR] fix: parallel parquet can underflow when max_record_batch_rows < execution.batch_size [arrow-datafusion]

via GitHub Fri, 22 Mar 2024 11:36:18 -0700


alamb commented on code in PR #9737:
URL: https://github.com/apache/arrow-datafusion/pull/9737#discussion_r1536022590



##########
datafusion/core/src/dataframe/parquet.rs:
##########
@@ -166,4 +166,47 @@ mod tests {
 
         Ok(())
     }
+
+    #[tokio::test]
+    async fn write_parquet_with_small_rg_size() -> Result<()> {
+        let mut test_df = test_util::test_table().await?;
+        // make the test data larger so there are multiple batches
+        for _ in 0..7 {
+            test_df = test_df.clone().union(test_df)?;
+        }
+        let output_path = "file://local/test.parquet";
+
+        for rg_size in (1..7).step_by(5) {
+            let df = test_df.clone();
+            let tmp_dir = TempDir::new()?;
+            let local = Arc::new(LocalFileSystem::new_with_prefix(&tmp_dir)?);
+            let local_url = Url::parse("file://local").unwrap();
+            let ctx = &test_df.session_state;
+            ctx.runtime_env().register_object_store(&local_url, local);
+            let mut options = TableParquetOptions::default();
+            options.global.max_row_group_size = rg_size;
+            options.global.allow_single_file_parallelism = true;
+            df.write_parquet(
+                output_path,
+                DataFrameWriteOptions::new().with_single_file_output(true),
+                Some(options),
+            )
+            .await?;
+
+            // Check that file actually used the correct rg size
+            let file = 
std::fs::File::open(tmp_dir.into_path().join("test.parquet"))?;

Review Comment:
   Calling `into_path` here I think means the file won't be cleaned up 
   
   I think calling `path()` would ensure the file is cleaned up



##########
datafusion/core/src/dataframe/parquet.rs:
##########
@@ -166,4 +166,47 @@ mod tests {
 
         Ok(())
     }
+
+    #[tokio::test]
+    async fn write_parquet_with_small_rg_size() -> Result<()> {
+        let mut test_df = test_util::test_table().await?;
+        // make the test data larger so there are multiple batches
+        for _ in 0..7 {
+            test_df = test_df.clone().union(test_df)?;
+        }
+        let output_path = "file://local/test.parquet";
+
+        for rg_size in (1..7).step_by(5) {

Review Comment:
   My reading of the 
[docs](https://doc.rust-lang.org/std/iter/trait.Iterator.html#method.step_by) 
and my [playground 
experiments](https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=2ff81b06e8447214338607095c726d7c)
 suggests this is the same as `[1, 6]` -- is that the intent? Or did you mean 
`1, 5, 10, 15, 20, 25, 30, 35`?



##########
datafusion/core/src/dataframe/parquet.rs:
##########
@@ -166,4 +166,47 @@ mod tests {
 
         Ok(())
     }
+
+    #[tokio::test]
+    async fn write_parquet_with_small_rg_size() -> Result<()> {
+        let mut test_df = test_util::test_table().await?;
+        // make the test data larger so there are multiple batches
+        for _ in 0..7 {
+            test_df = test_df.clone().union(test_df)?;

Review Comment:
   When I ran this test it takes more than 22 seconds on my laptop. I wonder if 
we really need to generate so much data -- maybe we can try slicing up the 
batch (or else maybe use larger rg_sizes)
   
   ```shell
   $ cargo test --lib -p datafusion -- write_parquet_with_small_rg_size
   ...
       Finished test [unoptimized + debuginfo] target(s) in 0.16s
        Running unittests src/lib.rs 
(target/debug/deps/datafusion-4cbfc61ad6017be4)
   
   running 1 test
   test dataframe::parquet::tests::write_parquet_with_small_rg_size ... ok
   
   test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 651 filtered 
out; finished in 22.31s
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix: parallel parquet can underflow when max_record_batch_rows < execution.batch_size [arrow-datafusion]

Reply via email to