camuel commented on issue #1797: URL: https://github.com/apache/iceberg-rust/issues/1797#issuecomment-3459125897
I think rust ecosystem needs something like DuckLake, may be FusionLake? Last time I looked into IcebergTableProvider it looked like building a mini-query-engine or even mini-DBMS within the IcebergTableProvider itself. Is this smart? Those manifestlists and manifests and puffins and json files, it is a lot of transactionally mutable data, is it really smart to work with them as just plain files? Especially by DBMS folks? Every time populate those rust collections, do something, then discard everything per each query? Or develop some sort of ORM cache just for iceberg metadata? Then there is all that advanced functionality like branching operations which only works with Spark. Does rust need to implement it? This is the real complexity! Given that DataFusion already can natively query avro and json, perhaps DataFusion itself can be used to produce a scan plan with advanced functionality with most logic encoded in SQL or in hand-crafted logical plan and not directly in rust? This w ill be inline with "canonical iceberg implementation" but going further along with Ducklake (and also Snowflake and Databricks implementation), perhaps DataFusion can use SQLite or Turso or any other OLTP DBMS to maintain its metadata, Iceberg or non-Iceberg, and then have it efficiently dump/ingest to/from canonical Iceberg metadata files and then perhaps even provide IRC out of the box (with the help of LakeKeeker for example) and federation with other IRCs. This will be a full Iceberg implementation on par with "Snowbricks duo". The thing with Iceberg is to separate "iceberg as data interchange format" where each transaction needs to generate those metadata files on S3 as part of each commit and "iceberg as internal data format" where for most or all transactions no other engine needs to access the data and it is wasteful to generate and regenerate all those numerous metadata files and then run housekeeping to remove them where the only engine accessing the data uses its own met astore. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
