Re: [I] [R] `RecordBatchReader$batches()` is very slow [arrow]

via GitHub Mon, 06 Apr 2026 15:22:22 -0700


thisisnic commented on issue #39090:
URL: https://github.com/apache/arrow/issues/39090#issuecomment-4195152295


   ## Investigation Summary
   
   The slowness is in the `to_r6<T>()` function in 
`r/src/arrow_cpp11.h:385-412`. For **each** record batch, it:
   
   1. Allocates an external pointer wrapping the `shared_ptr` (line 388)
   2. Looks up the R6 class symbol via `Rf_install` (line 389)
   3. Checks if the class exists in the arrow namespace (lines 394-400)
   4. Builds an R call expression `RecordBatch$new(xp)` (lines 404-405)
   5. **Evaluates that call via `Rf_eval`** (line 408) — this is the expensive 
part
   
   With 4096 batches, that's 4096 `Rf_eval` calls, each invoking R's 
interpreter, R6 method dispatch, and environment creation.
   
   ### Why `read_table()` is fast
   
   `read_table()` stays entirely in C++ via `reader->ToTable()` and returns a 
**single** R6 Table object — no per-batch R interpreter overhead.
   
   ### This is how Arrow R binds C++ objects generally
   
   The `to_r6` pattern is used throughout the package. The `to_r_list` function 
(which `batches()` uses) simply calls `to_r6` in a loop. This works fine when 
returning a handful of objects (columns, fields, etc.), but the overhead 
becomes visible when returning thousands.
   
   ### Potential fix
   
   A bulk approach could reduce the 4096 `Rf_eval` calls to 1:
   - C++ creates all external pointers in a list (fast, no R calls)
   - C++ makes ONE call to an R helper that does `lapply(xp_list, 
RecordBatch$new)`
   
   This would require changes to `to_r_list` in `arrow_cpp11.h` and adding an 
internal R helper function. However, this touches core binding infrastructure 
used throughout the package.
   
   ### Recommendation
   
   Given the scope of changes required and that most `to_r_list` use cases 
return small lists, this may be acceptable as a known limitation. Users needing 
performance with many batches should use `read_table()` instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [R] `RecordBatchReader$batches()` is very slow [arrow]

Reply via email to