Jefffrey commented on issue #5383: URL: https://github.com/apache/arrow-datafusion/issues/5383#issuecomment-1445473446
> I agree that I don't know what is better. I don't really use the DataFrame API and so I don't know if the "write multiple files" is an important feature or if it was just the most straightforward initial implementation I feel it is an important feature, as having control over the whether or not to repartition/coalesce before writes makes sense from a performance perspective. It seems Apache Spark has similar behaviour in writing empty partitions out, when I do a similar test. > I am not sure about this, although perhaps simple changing the code to ensure there is a header row written (even if there was no data) would be less confusing overall? Yeah I feel this header row part is technically a separate bug (though related since if didn't want to write empty partitions then wouldn't have to fix the bug). > Is it required for the right usage of the library, for the developer to be exposed to the facts that data could exist in "parts"? Speaking from a Spark background, I feel this is an important concept, to maximize parallelization via partitioning data. Having the default behaviour be to produce a single CSV on writes might make sense from a user friendly approach especially for smaller datasets, but could have performance implications in larger ones, requiring a coalesce to single partition. Though again this is from a Spark perspective, I'm not sure how different DataFusion is regarding performance of repartitions/coalesce (especially since it's not a distributed engine like Spark). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
