jorisvandenbossche commented on issue #33986:
URL: https://github.com/apache/arrow/issues/33986#issuecomment-1542354763

   Thanks for reviving this and the write-up, Will!  
   (coincidentally, I was finishing up an earlier write up yesterday about a 
protocol for just the C Data Interface (so without more dataset-like 
capabilities like predicate/projection pushdown), and while there is potential 
overlap, opened the separate issue I was writing anyway: 
https://github.com/apache/arrow/issues/35531)
   
   What is still not clear to me (also after reading the doc) is the exact 
scope or goal of a potential protocol (where is exactly the extension point?) 
Some possible ways I could interpret it:
   
   1. An interface for _consuming_ data from a dataset-like object, without 
having to be a `pyarrow.dataset.Dataset` (or Scanner) instance.
      * This is closer to "just" exposing an ArrowArrayStream, but with 
additional capabilities to inspect the schema and predicate/projection pushdown 
(before actually consuming the stream)
   2. An interface to _describe_ a dataset source (fragment file paths, 
filesystem, ..) such that any dataset implementation can read data from a 
source specified with this interface
      * Would this essentially be like a substrait ReadRel ? (in terms of 
information that it would need to capture, maybe with additional things like 
fragment guarantees/statistics). 
   4. An extension point (or ABC) _specificially for the pyarrow.dataset 
implementation_ such that you can define/implement a dataset source without 
having to extend Arrow Dataset at the C++ level, but still plug it into a 
pyarrow Dataset/Scanner, such that you can make use of some aspects of the 
arrow implementation (for example, if you attach guarantees to your fragments, 
then you can rely on the Arrow Dataset implementation to handle the predicate 
pushdown) or use consumers that already have support for pyarrow Datasets. This 
is basically trying to make Arrow C++ Datasets more easily extensible.
   
   
   Using the example of deltalake to duckdb, the three options would like like:
   
   1. The user creates a `deltalake.DeltaTable` object, which exposes a stream 
of data through some protocol. This table object is passed to a duckdb method, 
which no longer checks hardcoded for a pyarrow.Table/Dataset/Scanner, but 
checks for the generic protocol method being available. From that method, it 
can basically get the equivalent of a pyarrow Scanner(filter, projection) -> 
RecordBatchReader -> C Stream (but without hardcoding for the pyarrow APIs). 
The actual reading of the data itself is fully done by the deltalake 
implementation.
   2. The user creates a `deltalake.DeltaTable` object, which is passed to 
duckdb. Duckdb gets the information from this object about what to read (which 
files to read from where), but then reads it themselves. The actual reading 
here is done by a different library than the one that specified the source.
   3. The user lets deltalake create a pyarrow Dataset object on top of a 
deltalake fragment/scanner implementation. This object is passed to duckdb, and 
duckdb uses its current pyarrow integration to consume this data. The actual 
reading of the data is done by both deltalake and pyarrow.dataset (Arrow 
coordinating things for reading fragments and streaming that data, but the 
actual reading is done by the deltalake fragment implementation)
   
   What is other's people understanding of what we are discussing / are looking 
for?
   
   I suppose the two questions from 
https://github.com/apache/arrow/issues/33986#issuecomment-1416418936 overlap 
with this, and essentially ask the same. 
   But for example, the google doc currently mentions "filesystems" as one of 
the challenges, but if it's option 1 that we are interested in, I still don't 
fully understand how filesystems are involved (the filesystem (interaction) is 
fully defined by the producer (deltalake, or user of deltalake that created the 
deltalake Table object), and once you are reading data from that source 
(duckdb), you don't have to be aware of the filesystem details?)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to