liurenjie1024 commented on issue #1797: URL: https://github.com/apache/iceberg-rust/issues/1797#issuecomment-3468306274
> Our ideal state is not using the IcebergTableProvider that is provided by this project out of the box. We went through a similar exercise with delta-rs and found the maintenance burden of coordinating DataFusion versions to delta-rs versions with our usage of DataFusion to be very difficult. We really like the approach that the [delta-kernel-rs](https://github.com/delta-io/delta-kernel-rs) team took with providing a good set of primitives that can be used during planning, which we then use to [hook into the advanced Parquet reading capabilities](https://github.com/spiceai/spiceai/blob/trunk/crates/data_components/src/delta_lake.rs#L357) that DataFusion has (i.e. ParquetExec, ParquetAccessPlan, object_store, etc). > > So our wishlist would be: > > * A "kernel" (similar to what delta-kernel does) that separates the planning from execution and makes it easy to integrate into a custom query engine. In fact, iceberg-rust is organized in similar ways. This repo contains several crates, which could be categorized as following: * iceberg: This is similar to iceberg-core in java, which provides a lot of compute engine independent building blocks, such as planning api, transaction api, and data file reader/writers. * iceberg-catalog-*: These crates are concrete catalog implementations so that users don't need to include all dependencies. * integraiontes: These are crates which provide integrations with different engines. Currently the main focus is datafusion due to its extensibility. >* Allow using object_store for the kernel IO (ref: https://github.com/apache/iceberg-rust/issues/172) instead of OpenDAL, since we are already heavily invested in it. There are already undergoing effort for this part, see https://github.com/apache/iceberg-rust/pull/1755 (thanks @CTTY ). > * A "reference" implementation of using the kernel (i.e. it could be IcebergTableProvider, but maybe just an example) that shows how to separate the planning of which files to read (and which rows to mask) with a deep integration into the DataFusion ParquetExec machinery. I think its fine to leave the IcebergTableProvider as a "batteries-included" provider that does everything using OpenDAL, as long as we had the primitives above. The datafusion integration, e.g. `IcebergTableProvider` could be used for this purpose. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
