corleyma commented on issue #7598:
URL: https://github.com/apache/iceberg/issues/7598#issuecomment-1546795989

   I just wanted to second this as the best way to integrate with PyArrow;  
I've been doing essentially this approach using `scan.plan_files()`, but it's 
imperfect because PyIceberg does a lot of the important logic for reconciling 
schema evolutions, etc, after the scan planning.
   
   Specifically, the reconciliation logic happens in the [project_table 
](https://github.com/apache/iceberg/blob/b88951bbef5140e69aac9da5dc39e7e6eeb5100f/python/pyiceberg/io/pyarrow.py#L744)
 function. (specifically, `to_duckdb` calls `to_arrow` which calls 
`project_table`).  `project_table` both generates the projection plan based on 
the iceberg schema and executes the projection, loading a PyArrow table into 
memory for every file in the scan plan and concatenating them together.  The 
concatenated table is what ultimately gets passed around to pyarrow or duckdb, 
which is not great if e.g. your subsequent queries could have benefited from 
further pushdown that would have limited the amount of data that needed to be 
read.
   
   So, @wjones127, ideally there'd be some way to express this Iceberg schema 
reconciliation logic directly in a PyArrow dataset, which would become the 
logical source node passed to different engines to do their own 
querying/pushdown.  Some time ago I think I looked at how feasible this would 
be to do in pyarrow Dataset and I _think_ I concluded it wasn't possible yet, 
but my recollection is a little hazy.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to