wjones127 commented on issue #33986: URL: https://github.com/apache/arrow/issues/33986#issuecomment-1416418936
The interface is for scanning the dataset, which is _after_ the filesystems have been passed. So it's a separate concern. Yet it is still relevant to "how do I extend dataset" because your scanning implementation needs to use some filesystem. And that means the user needs to configure and pass one in. The easiest for users is to take fsspec / PyArrow filesystems as Python interfaces, although performance may be impacted by the GIL. (I have started, but not finished, an implementation of `ObjectStore` in Rust that wraps fsspec filesystems [here](https://github.com/delta-io/delta-rs/pull/900).) Or you can allow configuring a native filesystem, but then it's another API users have to learn. > wants to take advantage of the API of the former to get access to things like DuckDB integration IMO the current DuckDB integration feels a little silly. It manipulated Python objects until it can get a RBR and the exports that through the C data interface. The same code is duplicated in the R package, except it manipulates R objects. And nothing is available in other languages. So part of me thinks it would be cleaner to replace that integration with this kind of C API, but that's for the DuckDB devs to decide :) So there's sort of two questions: * How do I *extend* Dataset from a separate package, particularly if implemented in Rust? This is where the filesystem API / configuration stuff comes in. * How do I *consume* Dataset from a separate package? This is where the DuckDB integration comes in. It's unclear to me rn whether we just want to create a C API you can use _instead_ of Dataset to solve this, or make a C API _on_ Dataset to solve these. The former is less complicated for sure, but not sure we want to sidestep the Dataset API like that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
