wjones127 opened a new issue, #7598: URL: https://github.com/apache/iceberg/issues/7598
### Feature Request / Improvement Hi, I've been looking at seeing what we can do to make PyArrow Datasets extensible for various table formats and making them consumable to various compute engines (including DuckDB, Polars, DataFusion, Dask). I've written up my observations here: https://docs.google.com/document/d/1r56nt5Un2E7yPrZO9YPknBN4EDtptpx-tqOZReHvq1U/edit?usp=sharing ## What this means for PyIceberg's API Currently, integration with engines like DuckDB means filters and projections have to be specified up front, rather than pushed down from the query: ```python con = table.scan( row_filter=GreaterThanOrEqual("trip_distance", 10.0), selected_fields=("VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime"), ).to_duckdb(table_name="distant_taxi_trips") ``` Ideally, we should be able to export the table as a dataset, register it in DuckDB (or some other engine), and then filters and projections can be pushed down as the engine sees fit. Then the following would perform equivalent to the above, but would be more user friendly: ```python dataset = table.to_pyarrow_dataset() con.register(dataset, "distant_taxi_trips") conn.sql(""""SELECT VendorID, tpep_pickup_datetime, tpep_dropoff_datetime FROM distant_taxi_trips WHERE trip_distance > 10.0""") ``` ### Query engine Other -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
