amoeba commented on issue #40723:
URL: https://github.com/apache/arrow/issues/40723#issuecomment-2024449558

   I wonder if we could hook things up so it returns a Dataset object for the 
_serialized_ Dataset, instead of the input dataset. I too am not sure about the 
use case of putting `write_dataset` mid-pipeline.  
   
   Beyond the above use case, it would be a nice time-saver for cases where you 
want to call `open_dataset` on the newly-serialized dataset but where it might 
take some time due to dataset discovery (particularly on cloud storage). I 
think when `write_dataset` returns it may already know enough to create a 
Dataset object and to skip discovery (schema, fragments, partitioning).
   
   Getting the files out of `write_dataset` could then be done like this:
   
   ```r
   ds %>% 
     write_dataset("outdir") %>%
      .$files
   ```
   
   and we wouldn't be limiting the information you could get out of 
`write_dataset` to just the files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to