camuel commented on issue #1797:
URL: https://github.com/apache/iceberg-rust/issues/1797#issuecomment-3459125897

   I think rust ecosystem needs something like DuckLake, may be FusionLake? 
Last time I looked into IcebergTableProvider it looked like building a 
mini-query-engine or even mini-DBMS within the IcebergTableProvider itself. Is 
this smart? Those manifestlists and manifests and puffins and json files, it is 
a lot of transactionally mutable data, is it really smart to work with them as 
just plain files? Especially by DBMS folks? Every time populate those rust 
collections, do something, then discard everything per each query? Or develop 
some sort of ORM cache just for iceberg metadata? Then there is all that 
advanced functionality like branching operations which only works with Spark. 
Does rust need to implement it? This is the real complexity! Given that 
DataFusion already can natively query avro and json, perhaps DataFusion itself 
can be used to produce a scan plan with advanced functionality with most logic 
encoded in SQL or in hand-crafted logical plan and not directly in rust? This w
 ill be inline with "canonical iceberg implementation" but going further along 
with Ducklake (and also Snowflake and Databricks implementation), perhaps 
DataFusion can use SQLite or Turso or any other OLTP DBMS to maintain its 
metadata, Iceberg or non-Iceberg, and then have it efficiently dump/ingest 
to/from canonical Iceberg metadata files and then perhaps even provide IRC out 
of the box (with the help of LakeKeeker for example) and federation with other 
IRCs. This will be a full Iceberg implementation on par with "Snowbricks duo".  
The thing with Iceberg is to separate "iceberg as data interchange format" 
where each transaction needs to generate those metadata files on S3 as part of 
each commit and "iceberg as internal data format" where for most or all 
transactions no other engine needs to access the data and it is wasteful to 
generate and regenerate all those numerous metadata files and then run 
housekeeping to remove them where the only engine accessing the data uses its 
own met
 astore. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to