wjones127 opened a new issue, #7598:
URL: https://github.com/apache/iceberg/issues/7598

   ### Feature Request / Improvement
   
   Hi, I've been looking at seeing what we can do to make PyArrow Datasets 
extensible for various table formats and making them consumable to various 
compute engines (including DuckDB, Polars, DataFusion, Dask). I've written up 
my observations here: 
https://docs.google.com/document/d/1r56nt5Un2E7yPrZO9YPknBN4EDtptpx-tqOZReHvq1U/edit?usp=sharing
   
   ## What this means for PyIceberg's API
   
   Currently, integration with engines like DuckDB means filters and 
projections have to be specified up front, rather than pushed down from the 
query:
   
   ```python
   con = table.scan(
       row_filter=GreaterThanOrEqual("trip_distance", 10.0),
       selected_fields=("VendorID", "tpep_pickup_datetime", 
"tpep_dropoff_datetime"),
   ).to_duckdb(table_name="distant_taxi_trips")
   ```
   
   Ideally, we should be able to export the table as a dataset, register it in 
DuckDB (or some other engine), and then filters and projections can be pushed 
down as the engine sees fit. Then the following would perform equivalent to the 
above, but would be more user friendly:
   
   ```python
   dataset = table.to_pyarrow_dataset()
   con.register(dataset, "distant_taxi_trips")
   conn.sql(""""SELECT VendorID, tpep_pickup_datetime, tpep_dropoff_datetime
       FROM distant_taxi_trips
       WHERE trip_distance > 10.0""")
   ```
   
   
   ### Query engine
   
   Other


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to