Re: [PR] DataSink Dynamic Execution Time Demux [arrow-datafusion]

via GitHub Thu, 12 Oct 2023 06:35:26 -0700


devinjdangelo commented on code in PR #7791:
URL: https://github.com/apache/arrow-datafusion/pull/7791#discussion_r1356835532



##########
datafusion/common/src/config.rs:
##########
@@ -254,6 +254,24 @@ config_namespace! {
 
         /// Number of files to read in parallel when inferring schema and 
statistics
         pub meta_fetch_concurrency: usize, default = 32
+
+        /// Target number of rows in output files when writing multiple.
+        /// This is a soft max, so it can be exceeded slightly. There also
+        /// will be one file smaller than the limit if the total
+        /// number of rows written is not roughly divisible by the soft max
+        pub soft_max_rows_per_output_file: usize, default = 50000000
+
+        /// This is the maximum number of output files being written
+        /// in parallel. Higher values can potentially give faster write
+        /// performance at the cost of higher peak memory consumption.
+        pub max_parallel_ouput_files: usize, default = 8
+
+        /// This is the maximum number of RecordBatches buffered
+        /// for each output file being worked. Higher values can potentially
+        /// give faster write performance at the cost of higher peak
+        /// memory consumption
+        pub max_buffered_batches_per_output_file: usize, default = 5000

Review Comment:
   The reason I set this so high is to allow for the possibility that 1 file 
writer cannot keep up with the batches being generated. Eventually enough data 
is buffered that the 2nd, 3rd, ... file writer will kick in and work in 
parallel. Eventually it will stabilize and keep up with the speed that batches 
are being generated. If the buffer is too small, then only 1 file can be worked 
in parallel. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] DataSink Dynamic Execution Time Demux [arrow-datafusion]

Reply via email to