thisisnic commented on issue #39090: URL: https://github.com/apache/arrow/issues/39090#issuecomment-4195152295
## Investigation Summary The slowness is in the `to_r6<T>()` function in `r/src/arrow_cpp11.h:385-412`. For **each** record batch, it: 1. Allocates an external pointer wrapping the `shared_ptr` (line 388) 2. Looks up the R6 class symbol via `Rf_install` (line 389) 3. Checks if the class exists in the arrow namespace (lines 394-400) 4. Builds an R call expression `RecordBatch$new(xp)` (lines 404-405) 5. **Evaluates that call via `Rf_eval`** (line 408) — this is the expensive part With 4096 batches, that's 4096 `Rf_eval` calls, each invoking R's interpreter, R6 method dispatch, and environment creation. ### Why `read_table()` is fast `read_table()` stays entirely in C++ via `reader->ToTable()` and returns a **single** R6 Table object — no per-batch R interpreter overhead. ### This is how Arrow R binds C++ objects generally The `to_r6` pattern is used throughout the package. The `to_r_list` function (which `batches()` uses) simply calls `to_r6` in a loop. This works fine when returning a handful of objects (columns, fields, etc.), but the overhead becomes visible when returning thousands. ### Potential fix A bulk approach could reduce the 4096 `Rf_eval` calls to 1: - C++ creates all external pointers in a list (fast, no R calls) - C++ makes ONE call to an R helper that does `lapply(xp_list, RecordBatch$new)` This would require changes to `to_r_list` in `arrow_cpp11.h` and adding an internal R helper function. However, this touches core binding infrastructure used throughout the package. ### Recommendation Given the scope of changes required and that most `to_r_list` use cases return small lists, this may be acceptable as a known limitation. Users needing performance with many batches should use `read_table()` instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
