wjones127 opened a new issue, #37504: URL: https://github.com/apache/arrow/issues/37504
### Describe the enhancement requested Based on discussion in the [2023-08-30 Arrow community meeting](https://docs.google.com/document/d/1xrji8fc6_24TVmKiHJB4ECX1Zy2sy2eRbBjpVJMnPmk/edit#heading=h.k1ts4kvvl8jq). This is a continuation of https://github.com/apache/arrow/pull/35568 and https://github.com/apache/arrow/issues/33986. We'd like to have a protocol for sharing unmaterialized datasets that: 1. Can be consumed as one or more streams of Arrow data 2. Can have projections and filters pushed down to the scanner This would provide a extendible connection between scanners and query engines. Data formats might include Iceberg, Delta Lake, Lance, and PyArrow datasets (parquet, JSON, CSV). Query engines could include DuckDB, DataFusion, Polars, PyVelox, PySpark, Ray, and Dask. Such a connection would let end-users employ their preferred query engine to load any supported dataset. From their perspective, usage would might look like: ```python from deltalake import DeltaTable table = DeltaTable("path/to/table") import duckdb duckdb.sql("SELECT y FROM table WHERE X > 3") ``` ## Shape of the protocol The overall shape would look roughly like: ```python from abc import ABC class AbstractArrowScannable(ABC): def __arrow_dataset__(self) -> AbstractArrowScanner class AbstractArrowScanner(ABC): def get_schema(self) -> capsule[ArrowSchema]: ... def get_stream( self, columns: List[str], filter: SubstraitExpression, ) -> capsule[ArrowArrayStream]: ... def get_partitions(self, filter: filter: SubstraitExpression) -> list[AbstractArrowScanner]: ... ``` Data and schema are returned as C Data Interface objects (see: 35531). Predicates are passed as [Substrait extended expressions](https://substrait.io/expressions/extended_expression/). ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
