devinjdangelo commented on code in PR #7791: URL: https://github.com/apache/arrow-datafusion/pull/7791#discussion_r1356837488
########## datafusion/common/src/config.rs: ########## @@ -254,6 +254,24 @@ config_namespace! { /// Number of files to read in parallel when inferring schema and statistics pub meta_fetch_concurrency: usize, default = 32 + + /// Target number of rows in output files when writing multiple. + /// This is a soft max, so it can be exceeded slightly. There also + /// will be one file smaller than the limit if the total + /// number of rows written is not roughly divisible by the soft max + pub soft_max_rows_per_output_file: usize, default = 50000000 + + /// This is the maximum number of output files being written + /// in parallel. Higher values can potentially give faster write + /// performance at the cost of higher peak memory consumption. + pub max_parallel_ouput_files: usize, default = 8 + + /// This is the maximum number of RecordBatches buffered + /// for each output file being worked. Higher values can potentially + /// give faster write performance at the cost of higher peak + /// memory consumption + pub max_buffered_batches_per_output_file: usize, default = 5000 Review Comment: Having a "minimum_parallel_writers" setting as discussed in our main comments is a better solution for this. So, I'll reduce the buffer size as you suggest. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org