paleolimbot commented on pull request #12323: URL: https://github.com/apache/arrow/pull/12323#issuecomment-1055461092
Thanks! > What's the timing on this? There is no particular rush on `read_feather()` and `read_csv_arrow()` working with R connections. It doesn't have to be solved for this PR, either, although if this PR is merged it would be best to fix before the next CRAN release. > We can probably come up with a "limit all CPU and I/O tasks to the R thread" solution more easily than a "use the CPU thread pool for CPU tasks but limit all I/O tasks to the R thread" but the latter should probably be possible. I'm still wrapping my head around the specifics here, but because they might be related I'll list the "calling the R thread" possibilities I've run into recently in case any of them makes one of those options more obvious to pursue. - This PR, when a user wants to use some Arrow machinery but needs to implement the `InputStream` or `OutputStream` as R functions because for whatever reason the filesystem/input stream type isn't implemented in Arrow C++ or the R bindings - A user has a `RecordBatchReader` where calling the get next batch method is an R function. I haven't had time to look into it properly but this crashes every time I've tried to put it into the query engine (works for read_table(), though). Possibly related is a `RecordBatchReader.from_batches()` that was imported from Python via the C interface, which also crashes when put into the query engine (but not read_table()). - An extension type implemented in R that has a custom `ExtensionEquals()` method (just starting this in #12467). - A compute function that wraps an R function (e.g., for things like geospatial operators whose external dependencies are impractical or impossible to include in the arrow R package) From the R end, I know there is a way to request the evaluation of something on the main thread from elsewhere; however, there needs to be an event loop on the main thread checking for tasks for that to work. I don't know much about it but I do know it has been used elsewhere for packages like Shiny and plumber that accept HTTP requests and funnel them to R functions. > Although, we could probably address that particular performance impact if the underlying technology has support for an asynchronous API (as it seems that R's curl package does) In my mind, supporting R connections is more about providing a (possibly slow) workaround for things that Arrow C++ or the R bindings can't do yet (e.g., URLs). I do know that the async API for curl from the R end is along the lines of `open_async(url, function(chunk, is_last_chunk))`. R connections are a pain and if there are more use-cases along these lines it might be worth investing in some C struct definitions where its clear that callable members must be thread safe. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
