colinmarc commented on issue #1797:
URL: https://github.com/apache/iceberg-rust/issues/1797#issuecomment-3458317766

   Hi, thanks for raising this. We maintain a fork at 
https://github.com/bauplanlabs/iceberg-rust, but only to stay ahead of patches 
being merged. The other fork I've seen used is 
[RisingWave's](https://github.com/risingwavelabs/iceberg-rust), which I think 
is in a similar position (although they are more diverged).
   
   We use iceberg-rust in production in concert with datafusion, but we don't 
use the `IcebergTableProvider` directly, even though we would really like to. 
Instead, we use `iceberg-rust` just for fetching/pruning the manifest lists and 
then use DataFusion directly. This is awkward, and error prone, and we'd really 
like to avoid a hack like that. I think there are three particurlarly 
low-hanging fruit that would really make using `IcebergTableProvider` feasible 
for us:
   
    - **Fixing the deadlock(s) in the read path:** in particular this PR seems 
excellent and it has gotten zero attention: 
https://github.com/apache/iceberg-rust/pull/1486
    - **Output partitioning**: as I raised in slack, reading from 
`IcebergTableProvider` limits you to a single thread, which is obviously not 
very useful. Here are some 
[benchmarks](https://github.com/colinmarc/iceberg-datafusion-benchmarks) I 
created to demonstrate the issue. [This 
issue](https://github.com/apache/iceberg-rust/issues/1604) was closed as not 
planned. I don't understand the issue 100%, so I hope I'm not misconstruing 
anything.
    - **Taking advantage of DataFusion's parquet optimizations**: unless I 
missed something, `iceberg-rust` doesn't use `DataSourceExec`/`ParquetSource`, 
which means we would automatically lose out on a lot of parquet optimizations 
already landed or being landed in DataFusion, like metadata caching. I don't 
understand if that's intended or not. Again, it's possible I'm 
misunderstanding, apologies if so.
   
   We're more than willing to contribute fixes and features, and we have 
already, but the ones above are pretty intimidating for me to tackle without 
any help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to