r2evans commented on issue #40723:
URL: https://github.com/apache/arrow/issues/40723#issuecomment-2026351876
I understand preferring simplicity, but ... I cannot think of a situation
where I would use `write_dataset` in the middle of a data pipe and expect the
data to continue. The premise of `write_dataset` is on its _side-effect_, not
on its return value. (I don't understand it for `readr::write_csv`, either.) I
suggest that the strength in returning the filenames-created is that the
underlying code (somewhere) already knows what the filenames are, so there
should be no need to scan for them. Scanning for them is exacerbated when we
use partitioning into existing path with existing partitions. Whether we delete
existing files, use a different `basename_template=`, or whatever, when
partitions pre-exist, there are no "_inexpensive_" ways to determine if my
recent call to `write_dataset` is what created the files I see. (I characterize
"scanning all files" as relatively expensive. I do much work on an HPC using a
filesystem that has, at times, had a significant 5-10 second _lag_, but tha
t is just anecdotal for why I suggest that forcing a scan of the filesystem
should not be necessary.)
Think of this:
```r
arrow::write_dataset(mtcars[1:3,], td, partition = "cyl")
list.files(td, recursive = TRUE)
# [1] "cyl=4/part-0.parquet" "cyl=6/part-0.parquet"
### pause ...
arrow::write_dataset(mtcars[4:5,], td, partition = "cyl")
list.files(td, recursive = TRUE)
# [1] "cyl=4/part-0.parquet" "cyl=6/part-0.parquet" "cyl=8/part-0.parquet"
```
Which of those files were created on the second write? If you know the data,
then you'll know that `cyl=4` is not present in rows 4-5, so if we disect the
data we can figure that out, but consider millions of rows and this becomes a
bit more onerous.
Yes, we _can_ do a scan of the directory structure, and we'll see which
files were created in the second batch by comparing the `ctime` values:
```r
file.info(list.files(td, recursive = TRUE, full.names = TRUE), extra_cols =
FALSE)
# size isdir
mode mtime ctime atime
# /home/r2/tmp/Rtmp1ZaQHP/file158b58c3a3b4.d/cyl=4/part-0.parquet 4579 FALSE
664 2024-03-28 19:57:03 2024-03-28 19:57:03 2024-03-28 19:57:03
# /home/r2/tmp/Rtmp1ZaQHP/file158b58c3a3b4.d/cyl=6/part-0.parquet 4579 FALSE
664 2024-03-28 19:58:11 2024-03-28 19:58:11 2024-03-28 19:57:03
# /home/r2/tmp/Rtmp1ZaQHP/file158b58c3a3b4.d/cyl=8/part-0.parquet 4579 FALSE
664 2024-03-28 19:58:11 2024-03-28 19:58:11 2024-03-28 19:58:11
```
But do we really need to do that? The underlying code _knows_ (somewhere)
that it only wrote to the `cyl=6` and `cyl=8` partitions.
Conversely, it returns filenames and we want it to do the "tidyverse
principle" of returning data, we can _trivially_ get that with
```r
some %>%
pipe() %>%
{ write_dataset(path=td); .; } %>%
something_more()
```
Granted, if you are amenable to adding an option or two that facilitates
either or both of those pathways, then it could be as simple as one of the
following:
```r
some %>%
pipe() %>%
write_dataset(path=td) %>%
something_more()
newfiles <- write_dataset(path=td, return_files=TRUE)
```
or
```r
some %>%
pipe() %>%
write_dataset(path=td, return_data=TRUE) %>%
something_more()
```
or some variant of either/both of those.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]