wjones127 opened a new issue, #37504:
URL: https://github.com/apache/arrow/issues/37504

   ### Describe the enhancement requested
   
   Based on discussion in the [2023-08-30 Arrow community 
meeting](https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/edit#heading=h.k1ts4kvvl8jq).
 This is a continuation of https://github.com/apache/arrow/pull/35568 and 
https://github.com/apache/arrow/issues/33986.
   
   We'd like to have a protocol for sharing unmaterialized datasets that:
   
    1. Can be consumed as one or more streams of Arrow data
    2. Can have projections and filters pushed down to the scanner
   
   This would provide a extendible connection between scanners and query 
engines. Data formats might include Iceberg, Delta Lake, Lance, and PyArrow 
datasets (parquet, JSON, CSV). Query engines could include DuckDB, DataFusion, 
Polars, PyVelox, PySpark, Ray, and Dask. Such a connection would let end-users 
employ their preferred query engine to load any supported dataset. From their 
perspective, usage would might look like:
   
   ```python
   from deltalake import DeltaTable
   table = DeltaTable("path/to/table")
   
   import duckdb
   duckdb.sql("SELECT y FROM table WHERE X > 3")
   ```
   
   ## Shape of the protocol
   
   The overall shape would look roughly like:
   
   ```python
   from abc import ABC
   
   class AbstractArrowScannable(ABC):
       def __arrow_dataset__(self) -> AbstractArrowScanner
   
   
   class AbstractArrowScanner(ABC):
       def get_schema(self) -> capsule[ArrowSchema]:
           ...
   
       def get_stream(
           self,
           columns: List[str],
           filter: SubstraitExpression,
       ) -> capsule[ArrowArrayStream]:
           ...
   
       def get_partitions(self, filter: filter: SubstraitExpression) -> 
list[AbstractArrowScanner]:
           ...
   
   ```
   
   Data and schema are returned as C Data Interface objects (see: 35531). 
Predicates are passed as [Substrait extended 
expressions](https://substrait.io/expressions/extended_expression/). 
   
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to