[GitHub] [iceberg] corleyma commented on issue #7598: Expose PyIceberg table as PyArrow Dataset

via GitHub Sat, 13 May 2023 20:11:02 -0700


corleyma commented on issue #7598:
URL: https://github.com/apache/iceberg/issues/7598#issuecomment-1546795989

I just wanted to second this as the best way to integrate with PyArrow;
I've been doing essentially this approach using `scan.plan_files()`, but it's
imperfect because PyIceberg does a lot of the important logic for reconciling
schema evolutions, etc, after the scan planning.

Specifically, the reconciliation logic happens in the [project_table
](https://github.com/apache/iceberg/blob/b88951bbef5140e69aac9da5dc39e7e6eeb5100f/python/pyiceberg/io/pyarrow.py#L744)
function. (specifically, `to_duckdb` calls `to_arrow` which calls
`project_table`). `project_table` both generates the projection plan based on
the iceberg schema and executes the projection, loading a PyArrow table into
memory for every file in the scan plan and concatenating them together. The
concatenated table is what ultimately gets passed around to pyarrow or duckdb,
which is not great if e.g. your subsequent queries could have benefited from
further pushdown that would have limited the amount of data that needed to be
read.

So, @wjones127, ideally there'd be some way to express this Iceberg schema
reconciliation logic directly in a PyArrow dataset, which would become the
logical source node passed to different engines to do their own
querying/pushdown. Some time ago I think I looked at how feasible this would
be to do in pyarrow Dataset and I _think_ I concluded it wasn't possible yet,
but my recollection is a little hazy.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] corleyma commented on issue #7598: Expose PyIceberg table as PyArrow Dataset

Reply via email to