r2evans commented on issue #40723:
URL: https://github.com/apache/arrow/issues/40723#issuecomment-2026351876

   I understand preferring simplicity, but ... I cannot think of a situation 
where I would use `write_dataset` in the middle of a data pipe and expect the 
data to continue. The premise of `write_dataset` is on its _side-effect_, not 
on its return value. (I don't understand it for `readr::write_csv`, either.) I 
suggest that the strength in returning the filenames-created is that the 
underlying code (somewhere) already knows what the filenames are, so there 
should be no need to scan for them. Scanning for them is exacerbated when we 
use partitioning into existing path with existing partitions. Whether we delete 
existing files, use a different `basename_template=`, or whatever, when 
partitions pre-exist, there are no "_inexpensive_" ways to determine if my 
recent call to `write_dataset` is what created the files I see. (I characterize 
"scanning all files" as relatively expensive. I do much work on an HPC using a 
filesystem that has, at times, had a significant 5-10 second _lag_, but tha
 t is just anecdotal for why I suggest that forcing a scan of the filesystem 
should not be necessary.)
   
   Think of this:
   
   ```r
   arrow::write_dataset(mtcars[1:3,], td, partition = "cyl")
   list.files(td, recursive = TRUE)
   # [1] "cyl=4/part-0.parquet" "cyl=6/part-0.parquet"
   ### pause ...
   arrow::write_dataset(mtcars[4:5,], td, partition = "cyl")
   list.files(td, recursive = TRUE)
   # [1] "cyl=4/part-0.parquet" "cyl=6/part-0.parquet" "cyl=8/part-0.parquet"
   ```
   
   Which of those files were created on the second write? If you know the data, 
then you'll know that `cyl=4` is not present in rows 4-5, so if we disect the 
data we can figure that out, but consider millions of rows and this becomes a 
bit more onerous.
   
   Yes, we _can_ do a scan of the directory structure, and we'll see which 
files were created in the second batch by comparing the `ctime` values:
   
   ```r
   file.info(list.files(td, recursive = TRUE, full.names = TRUE), extra_cols = 
FALSE)
   #                                                                 size isdir 
mode               mtime               ctime               atime
   # /home/r2/tmp/Rtmp1ZaQHP/file158b58c3a3b4.d/cyl=4/part-0.parquet 4579 FALSE 
 664 2024-03-28 19:57:03 2024-03-28 19:57:03 2024-03-28 19:57:03
   # /home/r2/tmp/Rtmp1ZaQHP/file158b58c3a3b4.d/cyl=6/part-0.parquet 4579 FALSE 
 664 2024-03-28 19:58:11 2024-03-28 19:58:11 2024-03-28 19:57:03
   # /home/r2/tmp/Rtmp1ZaQHP/file158b58c3a3b4.d/cyl=8/part-0.parquet 4579 FALSE 
 664 2024-03-28 19:58:11 2024-03-28 19:58:11 2024-03-28 19:58:11
   ```
   
   But do we really need to do that? The underlying code _knows_ (somewhere) 
that it only wrote to the `cyl=6` and `cyl=8` partitions.
   
   Conversely, it returns filenames and we want it to do the "tidyverse 
principle" of returning data, we can _trivially_ get that with
   
   ```r
   some %>%
     pipe() %>%
     { write_dataset(path=td); .; } %>%
     something_more()
   ```
   
   Granted, if you are amenable to adding an option or two that facilitates 
either or both of those pathways, then it could be as simple as one of the 
following:
   
   ```r
   some %>%
     pipe() %>%
     write_dataset(path=td) %>%
     something_more()
   newfiles <- write_dataset(path=td, return_files=TRUE)
   ```
   
   or
   
   ```r
   some %>%
     pipe() %>%
     write_dataset(path=td, return_data=TRUE) %>%
     something_more()
   ```
   
   or some variant of either/both of those.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to