devinjdangelo commented on code in PR #7791:
URL: https://github.com/apache/arrow-datafusion/pull/7791#discussion_r1353775705
##########
datafusion/common/src/config.rs:
##########
@@ -254,6 +254,24 @@ config_namespace! {
/// Number of files to read in parallel when inferring schema and
statistics
pub meta_fetch_concurrency: usize, default = 32
+
+ /// Target number of rows in output files when writing multiple.
+ /// This is a soft max, so it can be exceeded slightly. There also
+ /// will be one file smaller than the limit if the total
+ /// number of rows written is not roughly divisible by the soft max
+ pub soft_max_rows_per_output_file: usize, default = 50000000
Review Comment:
The ideal value here is very situational, so definitely need to make this
configurable at the statement and table level.
##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -782,63 +706,93 @@ impl DataSink for ParquetSink {
.runtime_env()
.object_store(&self.config.object_store_url)?;
- let mut row_count = 0;
+ let exec_options = &context.session_config().options().execution;
+
+ let allow_single_file_parallelism =
+ exec_options.parquet.allow_single_file_parallelism;
+
+ // This is a temporary special case until
https://github.com/apache/arrow-datafusion/pull/7655
Review Comment:
The current single file parallelization strategy on main does not work well
with this PR (doesn't error but very slow). The pending one #7655 should work
great though. Once parquet crate has a new release, I can combine it with this
PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]