Re: [I] Dynamically Determine the Number of Output Files based on Configs [arrow-datafusion]

via GitHub Sun, 08 Oct 2023 10:57:33 -0700


alamb commented on issue #7767:
URL: 
https://github.com/apache/arrow-datafusion/issues/7767#issuecomment-1752117633


   > I would like to provide users with options such as the following which 
will determine the number of output files:
   >
   > Maximum rows per file
   > Maximum
   
   I agree this makes a lot of sense
   
   
   > FileSinkRepartitionExec could also have specialized logic for handling 
writes to hive style partitioned tables.
   
   I think this is what makes the most sense to me. Maybe we could combine some 
of the same logic to avoid writing files unless they actually have data.
   
   
   > FileSink could also be reworked to accept a single RecordBatchStream and 
handle repartitioning logic within its own execution plan, rather than creating 
a new upstream plan.
   
   I remember @tustvold  @metesynnada and @ozankabak  and I discussed the 
various tradeoffss between where the write partitoning would be determine (plan 
or in the writer) and i believe the conclusion was "it depends" 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Dynamically Determine the Number of Output Files based on Configs [arrow-datafusion]

Reply via email to