colinmarc commented on issue #1797: URL: https://github.com/apache/iceberg-rust/issues/1797#issuecomment-3458317766
Hi, thanks for raising this. We maintain a fork at https://github.com/bauplanlabs/iceberg-rust, but only to stay ahead of patches being merged. The other fork I've seen used is [RisingWave's](https://github.com/risingwavelabs/iceberg-rust), which I think is in a similar position (although they are more diverged). We use iceberg-rust in production in concert with datafusion, but we don't use the `IcebergTableProvider` directly, even though we would really like to. Instead, we use `iceberg-rust` just for fetching/pruning the manifest lists and then use DataFusion directly. This is awkward, and error prone, and we'd really like to avoid a hack like that. I think there are three particurlarly low-hanging fruit that would really make using `IcebergTableProvider` feasible for us: - **Fixing the deadlock(s) in the read path:** in particular this PR seems excellent and it has gotten zero attention: https://github.com/apache/iceberg-rust/pull/1486 - **Output partitioning**: as I raised in slack, reading from `IcebergTableProvider` limits you to a single thread, which is obviously not very useful. Here are some [benchmarks](https://github.com/colinmarc/iceberg-datafusion-benchmarks) I created to demonstrate the issue. [This issue](https://github.com/apache/iceberg-rust/issues/1604) was closed as not planned. I don't understand the issue 100%, so I hope I'm not misconstruing anything. - **Taking advantage of DataFusion's parquet optimizations**: unless I missed something, `iceberg-rust` doesn't use `DataSourceExec`/`ParquetSource`, which means we would automatically lose out on a lot of parquet optimizations already landed or being landed in DataFusion, like metadata caching. I don't understand if that's intended or not. Again, it's possible I'm misunderstanding, apologies if so. We're more than willing to contribute fixes and features, and we have already, but the ones above are pretty intimidating for me to tackle without any help. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
