liurenjie1024 commented on issue #1797:
URL: https://github.com/apache/iceberg-rust/issues/1797#issuecomment-3468306274

   > Our ideal state is not using the IcebergTableProvider that is provided by 
this project out of the box. We went through a similar exercise with delta-rs 
and found the maintenance burden of coordinating DataFusion versions to 
delta-rs versions with our usage of DataFusion to be very difficult. We really 
like the approach that the 
[delta-kernel-rs](https://github.com/delta-io/delta-kernel-rs) team took with 
providing a good set of primitives that can be used during planning, which we 
then use to [hook into the advanced Parquet reading 
capabilities](https://github.com/spiceai/spiceai/blob/trunk/crates/data_components/src/delta_lake.rs#L357)
 that DataFusion has (i.e. ParquetExec, ParquetAccessPlan, object_store, etc).
   > 
   > So our wishlist would be:
   > 
   > * A "kernel" (similar to what delta-kernel does) that separates the 
planning from execution and makes it easy to integrate into a custom query 
engine.
   
   In fact, iceberg-rust is organized in similar ways. This repo contains 
several crates, which could be categorized as following:
   
   * iceberg: This is similar to iceberg-core in java, which provides a lot of 
compute engine independent building blocks, such as planning api, transaction 
api, and data file reader/writers.
   * iceberg-catalog-*: These crates are concrete catalog implementations so 
that users don't need to include all dependencies.
   * integraiontes: These are crates which provide integrations with different 
engines. Currently the main focus is datafusion due to its extensibility.
   
   >*  Allow using object_store for the kernel IO (ref: 
https://github.com/apache/iceberg-rust/issues/172) instead of OpenDAL, since we 
are already heavily invested in it.
   
   There are already undergoing effort for this part, see 
https://github.com/apache/iceberg-rust/pull/1755 (thanks @CTTY ).
   
   > * A "reference" implementation of using the kernel (i.e. it could be 
IcebergTableProvider, but maybe just an example) that shows how to separate the 
planning of which files to read (and which rows to mask) with a deep 
integration into the DataFusion ParquetExec machinery. I think its fine to 
leave the IcebergTableProvider as a "batteries-included" provider that does 
everything using OpenDAL, as long as we had the primitives above.
   
   The datafusion integration, e.g. `IcebergTableProvider` could be used for 
this purpose. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to