devinjdangelo commented on code in PR #7791: URL: https://github.com/apache/arrow-datafusion/pull/7791#discussion_r1353775705
########## datafusion/common/src/config.rs: ########## @@ -254,6 +254,24 @@ config_namespace! { /// Number of files to read in parallel when inferring schema and statistics pub meta_fetch_concurrency: usize, default = 32 + + /// Target number of rows in output files when writing multiple. + /// This is a soft max, so it can be exceeded slightly. There also + /// will be one file smaller than the limit if the total + /// number of rows written is not roughly divisible by the soft max + pub soft_max_rows_per_output_file: usize, default = 50000000 Review Comment: The ideal value here is very situational, so definitely need to make this configurable at the statement and table level. ########## datafusion/core/src/datasource/file_format/parquet.rs: ########## @@ -782,63 +706,93 @@ impl DataSink for ParquetSink { .runtime_env() .object_store(&self.config.object_store_url)?; - let mut row_count = 0; + let exec_options = &context.session_config().options().execution; + + let allow_single_file_parallelism = + exec_options.parquet.allow_single_file_parallelism; + + // This is a temporary special case until https://github.com/apache/arrow-datafusion/pull/7655 Review Comment: The current single file parallelization strategy on main does not work well with this PR (doesn't error but very slow). The pending one #7655 should work great though. Once parquet crate has a new release, I can combine it with this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org