[I] Dynamically Determine the Number of Output Files based on Configs [arrow-datafusion]

via GitHub Sat, 07 Oct 2023 05:09:57 -0700


devinjdangelo opened a new issue, #7767:
URL: https://github.com/apache/arrow-datafusion/issues/7767


   ### Is your feature request related to a problem or challenge?
   
   Currently FileSinks (e.g. ParquetSink), output 1 file for each input 
partition. E.g. the `FileSink::write_all` method accepts a 
`Vec<RecordBatchStream>` and independently serializes and writes each 
`RecordBatchStream` to an `ObjectStore`. This setup is easy to implement and 
efficient for parallelization (trivial to spawn a task to process each 
RecordBatchStream in parallel), but there are a few drawbacks:
   
   1. The number of output files is arbitrarily determined by how the plan is 
partitioned. Often times, you will see 1 output file for each vcore in your 
system, even if that means there is 1 file with data and 15 empty files. This 
is confusing and poor UX #5383 .
   2. We provide an option `single_file_output` that enables forcing only 1 
file output, but there is no finer grained control than that unless you 
explicitly repartition the plan (e.g. add a RoundRobinPartition(4) to get 4 
output files). 
   3. It is also unclear in the current setup how `FileSink` can or should 
handle writes to hive style partitioned tables #7744 since we cannot know the 
correct number of output files up front, and thus cannot construct a 
`Vec<RecordBatchStream>`
   
   ### Describe the solution you'd like
   
   I would like to provide users with options such as the following which will 
determine the number of output files:
   
   1. Maximum rows per file
   2. Maximum file size bytes
   
   To respect these settings, the execution plan will need to dynamically 
create new file writers as execution proceeds, rather than all up front. 
Enabling this is a challenge similar to the one discussed in #7744. Ultimately, 
the input signature of FileSinks will need to change. Perhaps an upstream 
execution plan (FileSinkRepartionExec) could be responsible for dividing a 
single incoming `RecordBatchStream` into a dynamic number of output streams 
`Stream<Item= RecordBatchStream`. FileSink then consume each stream as it 
arrives, spawning a new task to write each file. 
   
   FileSinkRepartitionExec could also have specialized logic for handling 
writes to hive style partitioned tables.
   
   ### Describe alternatives you've considered
   
   FileSink could also be reworked to accept a single `RecordBatchStream` and 
handle repartitioning logic within its own execution plan, rather than creating 
a new upstream plan.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Dynamically Determine the Number of Output Files based on Configs [arrow-datafusion]

Reply via email to