jorisvandenbossche commented on issue #33986: URL: https://github.com/apache/arrow/issues/33986#issuecomment-1542354763
Thanks for reviving this and the write-up, Will! (coincidentally, I was finishing up an earlier write up yesterday about a protocol for just the C Data Interface (so without more dataset-like capabilities like predicate/projection pushdown), and while there is potential overlap, opened the separate issue I was writing anyway: https://github.com/apache/arrow/issues/35531) What is still not clear to me (also after reading the doc) is the exact scope or goal of a potential protocol (where is exactly the extension point?) Some possible ways I could interpret it: 1. An interface for _consuming_ data from a dataset-like object, without having to be a `pyarrow.dataset.Dataset` (or Scanner) instance. * This is closer to "just" exposing an ArrowArrayStream, but with additional capabilities to inspect the schema and predicate/projection pushdown (before actually consuming the stream) 2. An interface to _describe_ a dataset source (fragment file paths, filesystem, ..) such that any dataset implementation can read data from a source specified with this interface * Would this essentially be like a substrait ReadRel ? (in terms of information that it would need to capture, maybe with additional things like fragment guarantees/statistics). 4. An extension point (or ABC) _specificially for the pyarrow.dataset implementation_ such that you can define/implement a dataset source without having to extend Arrow Dataset at the C++ level, but still plug it into a pyarrow Dataset/Scanner, such that you can make use of some aspects of the arrow implementation (for example, if you attach guarantees to your fragments, then you can rely on the Arrow Dataset implementation to handle the predicate pushdown) or use consumers that already have support for pyarrow Datasets. This is basically trying to make Arrow C++ Datasets more easily extensible. Using the example of deltalake to duckdb, the three options would like like: 1. The user creates a `deltalake.DeltaTable` object, which exposes a stream of data through some protocol. This table object is passed to a duckdb method, which no longer checks hardcoded for a pyarrow.Table/Dataset/Scanner, but checks for the generic protocol method being available. From that method, it can basically get the equivalent of a pyarrow Scanner(filter, projection) -> RecordBatchReader -> C Stream (but without hardcoding for the pyarrow APIs). The actual reading of the data itself is fully done by the deltalake implementation. 2. The user creates a `deltalake.DeltaTable` object, which is passed to duckdb. Duckdb gets the information from this object about what to read (which files to read from where), but then reads it themselves. The actual reading here is done by a different library than the one that specified the source. 3. The user lets deltalake create a pyarrow Dataset object on top of a deltalake fragment/scanner implementation. This object is passed to duckdb, and duckdb uses its current pyarrow integration to consume this data. The actual reading of the data is done by both deltalake and pyarrow.dataset (Arrow coordinating things for reading fragments and streaming that data, but the actual reading is done by the deltalake fragment implementation) What is other's people understanding of what we are discussing / are looking for? I suppose the two questions from https://github.com/apache/arrow/issues/33986#issuecomment-1416418936 overlap with this, and essentially ask the same. But for example, the google doc currently mentions "filesystems" as one of the challenges, but if it's option 1 that we are interested in, I still don't fully understand how filesystems are involved (the filesystem (interaction) is fully defined by the producer (deltalake, or user of deltalake that created the deltalake Table object), and once you are reading data from that source (duckdb), you don't have to be aware of the filesystem details?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
