[GitHub] [arrow] wjones127 commented on issue #33986: [Python][Rust] Create extension point in python for Dataset/Scanner

via GitHub Fri, 03 Feb 2023 13:25:27 -0800


wjones127 commented on issue #33986:
URL: https://github.com/apache/arrow/issues/33986#issuecomment-1416418936


   The interface is for scanning the dataset, which is _after_ the filesystems 
have been passed. So it's a separate concern. Yet it is still relevant to "how 
do I extend dataset" because your scanning implementation needs to use some 
filesystem. And that means the user needs to configure and pass one in.
   
   The easiest for users is to take fsspec / PyArrow filesystems as Python 
interfaces, although performance may be impacted by the GIL. (I have started, 
but not finished, an implementation of `ObjectStore` in Rust that wraps fsspec 
filesystems [here](https://github.com/delta-io/delta-rs/pull/900).) Or you can 
allow configuring a native filesystem, but then it's another API users have to 
learn. 
   
   > wants to take advantage of the API of the former to get access to things 
like DuckDB integration
   
   IMO the current DuckDB integration feels a little silly. It manipulated 
Python objects until it can get a RBR and the exports that through the C data 
interface. The same code is duplicated in the R package, except it manipulates 
R objects. And nothing is available in other languages. So part of me thinks it 
would be cleaner to replace that integration with this kind of C API, but 
that's for the DuckDB devs to decide :)
   
   So there's sort of two questions:
   
    * How do I *extend* Dataset from a separate package, particularly if 
implemented in Rust? This is where the filesystem API / configuration stuff 
comes in.
    * How do I *consume* Dataset from a separate package? This is where the 
DuckDB integration comes in.
   
   It's unclear to me rn whether we just want to create a C API you can use 
_instead_ of Dataset to solve this, or make a C API _on_ Dataset to solve 
these. The former is less complicated for sure, but not sure we want to 
sidestep the Dataset API like that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] wjones127 commented on issue #33986: [Python][Rust] Create extension point in python for Dataset/Scanner

Reply via email to