Re: [PR] DataSink Dynamic Execution Time Demux [arrow-datafusion]

via GitHub Tue, 10 Oct 2023 18:34:26 -0700


devinjdangelo commented on code in PR #7791:
URL: https://github.com/apache/arrow-datafusion/pull/7791#discussion_r1353775705



##########
datafusion/common/src/config.rs:
##########
@@ -254,6 +254,24 @@ config_namespace! {
 
         /// Number of files to read in parallel when inferring schema and 
statistics
         pub meta_fetch_concurrency: usize, default = 32
+
+        /// Target number of rows in output files when writing multiple.
+        /// This is a soft max, so it can be exceeded slightly. There also
+        /// will be one file smaller than the limit if the total
+        /// number of rows written is not roughly divisible by the soft max
+        pub soft_max_rows_per_output_file: usize, default = 50000000

Review Comment:
   The ideal value here is very situational, so definitely need to make this 
configurable at the statement and table level.



##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -782,63 +706,93 @@ impl DataSink for ParquetSink {
             .runtime_env()
             .object_store(&self.config.object_store_url)?;
 
-        let mut row_count = 0;
+        let exec_options = &context.session_config().options().execution;
+
+        let allow_single_file_parallelism =
+            exec_options.parquet.allow_single_file_parallelism;
+
+        // This is a temporary special case until 
https://github.com/apache/arrow-datafusion/pull/7655

Review Comment:
   The current single file parallelization strategy on main does not work well 
with this PR (doesn't error but very slow). The pending one #7655 should work 
great though. Once parquet crate has a new release, I can combine it with this 
PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] DataSink Dynamic Execution Time Demux [arrow-datafusion]

Reply via email to