Re: [PR] Parallelize Serialization of Columns within Parquet RowGroups [arrow-datafusion]

via GitHub Tue, 24 Oct 2023 16:40:31 -0700


devinjdangelo commented on code in PR #7655:
URL: https://github.com/apache/arrow-datafusion/pull/7655#discussion_r1370944242



##########
datafusion/common/src/config.rs:
##########
@@ -377,12 +377,24 @@ config_namespace! {
         pub bloom_filter_ndv: Option<u64>, default = None
 
         /// Controls whether DataFusion will attempt to speed up writing
-        /// large parquet files by first writing multiple smaller files
-        /// and then stitching them together into a single large file.
-        /// This will result in faster write speeds, but higher memory usage.
-        /// Also currently unsupported are bloom filters and column indexes
-        /// when single_file_parallelism is enabled.
-        pub allow_single_file_parallelism: bool, default = false
+        /// parquet files by serializing them in parallel. Each column
+        /// in each row group in each output file are serialized in parallel
+        /// leveraging a maximum possible core count of 
n_files*n_row_groups*n_columns.
+        pub allow_single_file_parallelism: bool, default = true
+
+        /// If allow_single_file_parallelism=true, this setting allows
+        /// applying backpressure to prevent working on too many row groups in
+        /// parallel in case of limited memory or slow I/O speed causing
+        /// OOM errors. Lowering this number limits memory growth at the cost
+        /// of potentially slower write speeds.
+        pub maximum_parallel_row_group_writers: usize, default = 16
+
+        /// If allow_single_file_parallelism=true, this setting allows
+        /// applying backpressure to prevent too many RecordBatches building
+        /// up in memory in case the parallel writers cannot consume them fast
+        /// enough. Lowering this number limits memory growth at the cost
+        /// of potentially lower write speeds.
+        pub maximum_buffered_record_batches_per_stream: usize, default = 200

Review Comment:
   This one does have a significant impact on performance if lowered 
significantly. I spent some time testing and tuning the exact values. Setting 
max parallel row groups to 2 and maximum_buffered_record_batches_per_stream to 
128 allows two row groups to run in parallel. If this is set too low, 
backpressure will kick in too long before a second row group can be spawned and 
everything will wait on just 1 rowgroup to write. 
   
   ```
                    ┌─────────────────┐        ┌────────────────────────┐
    ExecPlan───────►│ RowGroup1 Queue ├───────►│Parallel Col Serializers│
      │             └─────────────────┘        └────────────────────────┘
      │
      │
      │
      │             ┌─────────────────┐        ┌────────────────────────┐
      └────────────►│RowGroup2 Queue  ├───────►│Parallel Col Serializers│
                    └─────────────────┘        └────────────────────────┘
   Once max_rowgroup_rows
   Sent to RowGroup1 Queue
   Spawn a new Queue with
   its own parallel writers
   ```
   
   RowGroup2 Queue won't be created until RowGroup1 Queue has received the 
desired number of rows. The goal is to have two row groups serializing in 
parallel if RecordBatches are being produced fast enough. For a streaming plan 
reading from disk, we probably never need more than 2 in parallel. If we are 
writing already in-memory data on a system with many cores, it is highly 
beneficial to boost these queue sizes even more so we could have an arbitrarily 
large number of row groups serializing in parallel.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Parallelize Serialization of Columns within Parquet RowGroups [arrow-datafusion]

Reply via email to