devinjdangelo opened a new issue, #7767: URL: https://github.com/apache/arrow-datafusion/issues/7767
### Is your feature request related to a problem or challenge? Currently FileSinks (e.g. ParquetSink), output 1 file for each input partition. E.g. the `FileSink::write_all` method accepts a `Vec<RecordBatchStream>` and independently serializes and writes each `RecordBatchStream` to an `ObjectStore`. This setup is easy to implement and efficient for parallelization (trivial to spawn a task to process each RecordBatchStream in parallel), but there are a few drawbacks: 1. The number of output files is arbitrarily determined by how the plan is partitioned. Often times, you will see 1 output file for each vcore in your system, even if that means there is 1 file with data and 15 empty files. This is confusing and poor UX #5383 . 2. We provide an option `single_file_output` that enables forcing only 1 file output, but there is no finer grained control than that unless you explicitly repartition the plan (e.g. add a RoundRobinPartition(4) to get 4 output files). 3. It is also unclear in the current setup how `FileSink` can or should handle writes to hive style partitioned tables #7744 since we cannot know the correct number of output files up front, and thus cannot construct a `Vec<RecordBatchStream>` ### Describe the solution you'd like I would like to provide users with options such as the following which will determine the number of output files: 1. Maximum rows per file 2. Maximum file size bytes To respect these settings, the execution plan will need to dynamically create new file writers as execution proceeds, rather than all up front. Enabling this is a challenge similar to the one discussed in #7744. Ultimately, the input signature of FileSinks will need to change. Perhaps an upstream execution plan (FileSinkRepartionExec) could be responsible for dividing a single incoming `RecordBatchStream` into a dynamic number of output streams `Stream<Item= RecordBatchStream`. FileSink then consume each stream as it arrives, spawning a new task to write each file. FileSinkRepartitionExec could also have specialized logic for handling writes to hive style partitioned tables. ### Describe alternatives you've considered FileSink could also be reworked to accept a single `RecordBatchStream` and handle repartitioning logic within its own execution plan, rather than creating a new upstream plan. ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
