wjones127 commented on issue #5383:
URL: 
https://github.com/apache/arrow-datafusion/issues/5383#issuecomment-1445536499

   When it creates multiple files, is it creating one per "partition"? Or one 
per worker?
   
   I can understand the performance justification for one file per thread / 
worker. But I'm not sure about 1 file per worker, especially for a single node 
engine, where the location of a partition isn't as important.
   
   Also as @dadepo suggested, I don't think partitions (in-memory) is that 
useful to end users. I previously used Spark a lot, and my two biggest 
complaints about it are:
   
    1. It tightly coupled file layout with partitioning. If there were 5 large 
files, you wouldn't get more than 5 partitions unless you explicitly asked, 
even if the file format was something like Parquet that had a natural way to 
split up into more bite-sized chunks (such as row groups). Similar issue if you 
had a bunch of tiny files; you would get way to many partitions / tasks.
    2. It asked me to tune the number of partitions, when what I would rather 
tune is the batch size. This is particularly true when I didn't know the size 
of the incoming data; `.repartition(2000)` made sense when there was 20GB of 
data incoming but no sense when only 100MB was.
   
   From an end user perspectives, I'd much rather configure the number of 
workers and the batch size, not the number of partitions.
   
   IMO for now, it would be nice if DataFusion simply: 
   
   1. Didn't write empty files.
   2. Provide an example of passing writing the result of an execution plan to 
a single CSV file using the arrow-csv writer.
   
   In the long run, I'd prefer something similar to the Arrow C++ / [PyArrow 
dataset 
writer](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html).
 It handles the partitioning of the dataset, and provides knobs to control 
number of files that are more meaningful to users, such as max file sizes and 
max number of file open.
   
   Eventually, it would be nice to support something like:
   
   ```rust
   write_csv(df, Partitioning::SingleFile, WriterOptions::default()).await?;
   write_parquet(
       df,
       Partitioning::Hive(vec!["year", "month", "day"],
       WriterOptions::builder().max_rows_per_file(2_000_000).build(),
   ).await?;
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to