[GitHub] [arrow-datafusion] Jefffrey commented on issue #5383: The output of write_csv and write_json methods is confusing.

via GitHub Sun, 26 Feb 2023 02:49:04 -0800


Jefffrey commented on issue #5383:
URL: 
https://github.com/apache/arrow-datafusion/issues/5383#issuecomment-1445326546


   > I think it would be best to defer creating the files until there is 
actually some data (aka don't create the writer until we have at least a single 
record batch to write)
   
   @alamb  Another thought that comes to mind regarding this, is if this is 
done then could have cases where the parts written out aren't in 
sequential/increasing order, which could cause confusion as well. e.g. if parts 
2 and 4 are the only with data then only those will appear on the filesystem 
like:
   
   ```
   [4.0K]  csv
   ├── [  11]  part-2.csv
   └── [  11]  part-4.csv
   ```
   
   Am not sure which is more desirable, having 'gaps' in the parts written, vs. 
having empty parts. Or somehow only write the parts with data first (which 
would break the parallel behaviour of the writes? unless force repartition).
   
   > The other thing we can do would would be to add some way to the dataframe 
/ write_csv API to say "I want the results in a single partiton/file" -- 
perhaps by adding `DataFrame::repartititon` or something so the user can 
control if they want multiple files (potentially faster to write) or a single 
file (slower to write, but easier to use)
   
   This does sound like a good option to have for user flexibility, though it 
still leaves the question of what the default behaviour should be. Or maybe its 
best to leave the user to decide this, and document the method to hint towards 
it? Since it would be a simple wrapper over the `repartition(...)` method it 
seems.
   
   P.S. Another thing I just noticed, is that the empty partition files 
actually shouldn't be completely empty, they should have a header row. For CSV, 
the default is to have the header rows in files written, so those empty parts 
should at least have the header row.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Jefffrey commented on issue #5383: The output of write_csv and write_json methods is confusing.

Reply via email to