Re: [I] [R] cannot collect multiple times after passing to_arrow() and collect() [arrow]

via GitHub Thu, 12 Sep 2024 10:58:58 -0700


amoeba commented on issue #44069:
URL: https://github.com/apache/arrow/issues/44069#issuecomment-2346917084


   Hello @abduazizR, thanks for filing an issue. Some implementation details 
are leaking here and I can see how this is a bit confusing. Your `to_arrow()` 
call is creating what we call a RecordBatchReader which can only be consumed 
once and your call to `collect()` consumes it.
   
   You have two workarounds here,
   
   ```r
   # 1. Convert to an arrow Table first
   x <- iris |>
     to_duckdb() |>
     to_arrow() |>
     as_arrow_table()
   x |> collect() # can be called repeatedly
   
   # 2. or call to_arrow() every time you need to collect
   x <- iris |> to_duckdb()
   x |>
     to_arrow() |>
     collect() # can be called repeatedly
   ```
   
   Converting to a Table (option 1) comes with the downside that it 
materializes the entire Table in memory but this might work fine for your use 
case.
   
   In theory we could probably do some trick to make your original code work 
like resetting the RecordBatchReader on repeated calls to collect/compute but, 
at the very least, documenting this in `to_arrow()` would be good. I can file 
an PR for the latter. @nealrichardson do you have an opinion on leaving things 
as-is or implementing a fix here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [R] cannot collect multiple times after passing to_arrow() and collect() [arrow]

Reply via email to