Re: [PR] Parallelize Serialization of Columns within Parquet RowGroups [arrow-datafusion]

via GitHub Wed, 25 Oct 2023 05:44:49 -0700


devinjdangelo commented on code in PR #7655:
URL: https://github.com/apache/arrow-datafusion/pull/7655#discussion_r1371703274



##########
datafusion/common/src/config.rs:
##########
@@ -377,12 +377,24 @@ config_namespace! {
         pub bloom_filter_ndv: Option<u64>, default = None
 
         /// Controls whether DataFusion will attempt to speed up writing
-        /// large parquet files by first writing multiple smaller files
-        /// and then stitching them together into a single large file.
-        /// This will result in faster write speeds, but higher memory usage.
-        /// Also currently unsupported are bloom filters and column indexes
-        /// when single_file_parallelism is enabled.
-        pub allow_single_file_parallelism: bool, default = false
+        /// parquet files by serializing them in parallel. Each column
+        /// in each row group in each output file are serialized in parallel
+        /// leveraging a maximum possible core count of 
n_files*n_row_groups*n_columns.
+        pub allow_single_file_parallelism: bool, default = true
+
+        /// If allow_single_file_parallelism=true, this setting allows
+        /// applying backpressure to prevent working on too many row groups in
+        /// parallel in case of limited memory or slow I/O speed causing
+        /// OOM errors. Lowering this number limits memory growth at the cost
+        /// of potentially slower write speeds.
+        pub maximum_parallel_row_group_writers: usize, default = 16
+
+        /// If allow_single_file_parallelism=true, this setting allows
+        /// applying backpressure to prevent too many RecordBatches building
+        /// up in memory in case the parallel writers cannot consume them fast
+        /// enough. Lowering this number limits memory growth at the cost
+        /// of potentially lower write speeds.
+        pub maximum_buffered_record_batches_per_stream: usize, default = 200

Review Comment:
   I haven't added anything special to track with a MemoryReservation. It is 
just a bounded channel size. Memory usage can't grow without bound, but it can 
grow up to however large 128 RecordBatches is in memory. So with a particularly 
wide table or an extra large batch_size setting, I could see it climbing into 
Gb of memory. If we are concerned about that, we could use buffer_size=2 as the 
default and leave it up to user's if it is worth the memory/performance 
tradeoff to increase the buffer size.
   
   It is also true that the numbers above are gathered from a fairly extreme 
case of writing ~250 million rows to a single parquet file, and you could 
instead just write 2 or more files in parallel to close the performance gap. 
For a more reasonably sized ~50million rows and 1.5Gb the gap is smaller, but 
it is still there:
   
   ### Parallel Parquet Writer, Varying Row Group Buffer Sizes
     | Buffer Size=2 | Buffer Size=64 | Buffer Size=128
   -- | -- | -- | --
   Execution Time (s) | 19.76 | 13.41 | 13.68
   Peak Memory Usage (MB) | 272.2 | 277.2 | 281.8



##########
datafusion/common/src/config.rs:
##########
@@ -377,12 +377,24 @@ config_namespace! {
         pub bloom_filter_ndv: Option<u64>, default = None
 
         /// Controls whether DataFusion will attempt to speed up writing
-        /// large parquet files by first writing multiple smaller files
-        /// and then stitching them together into a single large file.
-        /// This will result in faster write speeds, but higher memory usage.
-        /// Also currently unsupported are bloom filters and column indexes
-        /// when single_file_parallelism is enabled.
-        pub allow_single_file_parallelism: bool, default = false
+        /// parquet files by serializing them in parallel. Each column
+        /// in each row group in each output file are serialized in parallel
+        /// leveraging a maximum possible core count of 
n_files*n_row_groups*n_columns.
+        pub allow_single_file_parallelism: bool, default = true
+
+        /// If allow_single_file_parallelism=true, this setting allows
+        /// applying backpressure to prevent working on too many row groups in
+        /// parallel in case of limited memory or slow I/O speed causing
+        /// OOM errors. Lowering this number limits memory growth at the cost
+        /// of potentially slower write speeds.
+        pub maximum_parallel_row_group_writers: usize, default = 16
+
+        /// If allow_single_file_parallelism=true, this setting allows
+        /// applying backpressure to prevent too many RecordBatches building
+        /// up in memory in case the parallel writers cannot consume them fast
+        /// enough. Lowering this number limits memory growth at the cost
+        /// of potentially lower write speeds.
+        pub maximum_buffered_record_batches_per_stream: usize, default = 200

Review Comment:
   I haven't added anything special to track with a MemoryReservation. It is 
just a bounded channel size. Memory usage can't grow without bound, but it can 
grow up to however large 128 RecordBatches is in memory. So with a particularly 
wide table or an extra large batch_size setting, I could see it climbing into 
Gb of memory. If we are concerned about that, we could use buffer_size=2 as the 
default and leave it up to user's if it is worth the memory/performance 
tradeoff to increase the buffer size.
   
   It is also true that the numbers above are gathered from a fairly extreme 
case of writing ~250 million rows to a single parquet file, and you could 
instead just write 2 or more files in parallel to close the performance gap. 
For a more reasonably sized ~50million rows and 1.5Gb the gap is smaller, but 
it is still there:
   
   ### Parallel Parquet Writer, Varying Row Group Buffer Sizes
     | Buffer Size=2 | Buffer Size=64 | Buffer Size=128
   -- | -- | -- | --
   Execution Time (s) | 19.76 | 13.41 | 13.68
   Peak Memory Usage (MB) | 272.2 | 277.2 | 281.8



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Parallelize Serialization of Columns within Parquet RowGroups [arrow-datafusion]

Reply via email to