Jefffrey commented on issue #5383:
URL: 
https://github.com/apache/arrow-datafusion/issues/5383#issuecomment-1445473446

   > I agree that I don't know what is better. I don't really use the DataFrame 
API and so I don't know if the "write multiple files" is an important feature 
or if it was just the most straightforward initial implementation
   
   I feel it is an important feature, as having control over the whether or not 
to repartition/coalesce before writes makes sense from a performance 
perspective. It seems Apache Spark has similar behaviour in writing empty 
partitions out, when I do a similar test.
   
   > I am not sure about this, although perhaps simple changing the code to 
ensure there is a header row written (even if there was no data) would be less 
confusing overall?
   
   Yeah I feel this header row part is technically a separate bug (though 
related since if didn't want to write empty partitions then wouldn't have to 
fix the bug).
   
   > Is it required for the right usage of the library, for the developer to be 
exposed to the facts that data could exist in "parts"?
   
   Speaking from a Spark background, I feel this is an important concept, to 
maximize parallelization via partitioning data. Having the default behaviour be 
to produce a single CSV on writes might make sense from a user friendly 
approach especially for smaller datasets, but could have performance 
implications in larger ones, requiring a coalesce to single partition.
   
   Though again this is from a Spark perspective, I'm not sure how different 
DataFusion is regarding performance of repartitions/coalesce (especially since 
it's not a distributed engine like Spark).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to