corleyma commented on issue #7598: URL: https://github.com/apache/iceberg/issues/7598#issuecomment-1546795989
I just wanted to second this as the best way to integrate with PyArrow; I've been doing essentially this approach using `scan.plan_files()`, but it's imperfect because PyIceberg does a lot of the important logic for reconciling schema evolutions, etc, after the scan planning. Specifically, the reconciliation logic happens in the [project_table ](https://github.com/apache/iceberg/blob/b88951bbef5140e69aac9da5dc39e7e6eeb5100f/python/pyiceberg/io/pyarrow.py#L744) function. (specifically, `to_duckdb` calls `to_arrow` which calls `project_table`). `project_table` both generates the projection plan based on the iceberg schema and executes the projection, loading a PyArrow table into memory for every file in the scan plan and concatenating them together. The concatenated table is what ultimately gets passed around to pyarrow or duckdb, which is not great if e.g. your subsequent queries could have benefited from further pushdown that would have limited the amount of data that needed to be read. So, @wjones127, ideally there'd be some way to express this Iceberg schema reconciliation logic directly in a PyArrow dataset, which would become the logical source node passed to different engines to do their own querying/pushdown. Some time ago I think I looked at how feasible this would be to do in pyarrow Dataset and I _think_ I concluded it wasn't possible yet, but my recollection is a little hazy. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
