Fokko commented on issue #7598:
URL: https://github.com/apache/iceberg/issues/7598#issuecomment-1548031781

   @wjones127 Thanks for raising this and doing all the work. I've added some 
comments to the Google doc and also the pull request that describes the 
interface.
   
   @corleyma has a good point here. I think the main reason why Arrow doesn't 
have an Iceberg implementation today is that it is quite a lot of work to get 
the details right. And the details make Iceberg so performant.
   
   As https://github.com/apache/arrow/issues/33986 suggest I think it would be 
great for PyIceberg to be able to produce and consume substrait plans. It could 
consume a light-level plan `SELECT * FROM s3://bucket/path/json@snapshot-id 
WHERE date_field => 2023-01-01` and produce a low-level plan where it would 
tell Arrow which files to read, and what kind of projection needs to be done. 
It will become complex though, for example, how would we express [positional 
deletes](https://github.com/apache/iceberg/pull/6775)? It can be done but would 
need some changes to substrait I assume.
   
   That said, I'm all in to see if we can integrate PyIceberg into Arrow. I 
agree that the dataset is the ideal situation. If there is anything that you 
want me to try, please let me know, I'm happy to help and see if we can make 
this work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to