Re: [I] write_dataset returns nothing [arrow]

via GitHub Sat, 23 Mar 2024 09:59:47 -0700


r2evans commented on issue #40723:
URL: https://github.com/apache/arrow/issues/40723#issuecomment-2016547321


   I agree that returning something useful is a good thing. Often, yes, 
mid-pipeline functions pass the data through (often invisibly), but I agree 
that I don't see this being a mid-pipeline kind of operation.
   
   My thought is to return the filenames created. While we are calling this 
function primarily for its side-effect, what we don't know is exactly what 
filenames will be created; the only time we "know" is when we're writing 
completely new data (partitioning subdirs do not exist or are known to be 
empty). If that is ever important (I have an internal use-case where it is), 
then we have to _infer_ what files were created, something that can be 
ambiguous in situations where there are pre-existing files.
   
   Do you think returning a `character` vector with the final filenames is both 
meaningful and feasible?
   
   Another argument for this: _if_ this returns files and it is being used 
mid-pipe where data would be more useful, there is an easy workaround:
   
   ```r
   # dplyr pipe
   dat %>%
     mutate(...) %>%
     { write_dataset(., ...); .; } %>%
     summarize(...)
   # native pipe
   dat |>
     transform(..) |>
     (\(.x) { write_dataset(.x, ...); .x; })() |>
     aggregate(..., data = _)
   ```
   
   But if the function instead returns data, there is no unambiguous way to 
immediately know the filenames created.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] write_dataset returns nothing [arrow]

Reply via email to