[GitHub] [arrow-datafusion] wjones127 commented on issue #5383: The output of write_csv and write_json methods is confusing.

via GitHub Sun, 26 Feb 2023 17:21:31 -0800


wjones127 commented on issue #5383:
URL: 
https://github.com/apache/arrow-datafusion/issues/5383#issuecomment-1445536499

When it creates multiple files, is it creating one per "partition"? Or one
per worker?

I can understand the performance justification for one file per thread /
worker. But I'm not sure about 1 file per worker, especially for a single node
engine, where the location of a partition isn't as important.

Also as @dadepo suggested, I don't think partitions (in-memory) is that
useful to end users. I previously used Spark a lot, and my two biggest
complaints about it are:

1. It tightly coupled file layout with partitioning. If there were 5 large
files, you wouldn't get more than 5 partitions unless you explicitly asked,
even if the file format was something like Parquet that had a natural way to
split up into more bite-sized chunks (such as row groups). Similar issue if you
had a bunch of tiny files; you would get way to many partitions / tasks.
2. It asked me to tune the number of partitions, when what I would rather
tune is the batch size. This is particularly true when I didn't know the size
of the incoming data; `.repartition(2000)` made sense when there was 20GB of
data incoming but no sense when only 100MB was.

From an end user perspectives, I'd much rather configure the number of
workers and the batch size, not the number of partitions.

IMO for now, it would be nice if DataFusion simply:

1. Didn't write empty files.
2. Provide an example of passing writing the result of an execution plan to
a single CSV file using the arrow-csv writer.

In the long run, I'd prefer something similar to the Arrow C++ / [PyArrow
dataset
writer](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html).
It handles the partitioning of the dataset, and provides knobs to control
number of files that are more meaningful to users, such as max file sizes and
max number of file open.

Eventually, it would be nice to support something like:

```rust
write_csv(df, Partitioning::SingleFile, WriterOptions::default()).await?;
write_parquet(
df,
Partitioning::Hive(vec!["year", "month", "day"],
WriterOptions::builder().max_rows_per_file(2_000_000).build(),
).await?;
```

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] wjones127 commented on issue #5383: The output of write_csv and write_json methods is confusing.

Reply via email to