devinjdangelo commented on code in PR #7791: URL: https://github.com/apache/arrow-datafusion/pull/7791#discussion_r1356835532
########## datafusion/common/src/config.rs: ########## @@ -254,6 +254,24 @@ config_namespace! { /// Number of files to read in parallel when inferring schema and statistics pub meta_fetch_concurrency: usize, default = 32 + + /// Target number of rows in output files when writing multiple. + /// This is a soft max, so it can be exceeded slightly. There also + /// will be one file smaller than the limit if the total + /// number of rows written is not roughly divisible by the soft max + pub soft_max_rows_per_output_file: usize, default = 50000000 + + /// This is the maximum number of output files being written + /// in parallel. Higher values can potentially give faster write + /// performance at the cost of higher peak memory consumption. + pub max_parallel_ouput_files: usize, default = 8 + + /// This is the maximum number of RecordBatches buffered + /// for each output file being worked. Higher values can potentially + /// give faster write performance at the cost of higher peak + /// memory consumption + pub max_buffered_batches_per_output_file: usize, default = 5000 Review Comment: The reason I set this so high is to allow for the possibility that 1 file writer cannot keep up with the batches being generated. Eventually enough data is buffered that the 2nd, 3rd, ... file writer will kick in and work in parallel. Eventually it will stabilize and keep up with the speed that batches are being generated. If the buffer is too small, then only 1 file can be worked in parallel. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org