Hi devs, I need help trying to figure out if and how adopting Iceberg fits into the picture of a generic architecture of a data-lake. The use-case is very broad so please excuse the abrupt and naive approach of attempting to summarize in a short email but I'll start with a rundown of the general use-case and try to narrow it down a bit by the time I ask specific questions about Iceberg support...
A generic approach to the architecture of a data-lake generally involves (at least) two stages for landing data before making it accessible for querying (terminology varies, a lot, ranging from zones, raw vs processed stores, ingestion tier and insights tier, etc.). Data is usually undergoing a particular set of transformations across these stages either successfully advancing to the next stage or forfeiting the promotion process - in either case there's a metadata operation involved recording status. When such a transformation is successful data is generally promoted to the next stage via a data move operation or metadata operation depending on the underlying file system implementation - either way it's a file path change. Adopting Iceberg as a data writer in any of the earlier stages would imply promoting Iceberg table changes across along with the promotion of the actual data files - so that eventually consumers can reliably make use of the Iceberg format. Taking the naive approach of having corresponding Iceberg tables across various stages I was wondering if there's any support for promoting commits across two Iceberg tables just by "tweaking" file paths for data files as a meta-data operation alone? Is that achievable with Iceberg today? This relates to my question on extending the API support based on Ryan's PR I brought up initially in the message, for this particular use-case supporting only append files commits would suffice. As a side note (but probably having considerable implications to the topic at hand) - after reading up on the "Updates/Deletes/Upserts in Iceberg" and trying to reason about the implications of implementing it I kind of got the feeling file paths become an Iceberg concern entirely, totally opaque to the consumer, either from the write or read path. Also I believe that data compaction can no longer be an external process either and it'd have to understand Iceberg data file semantics. These two implications would have a considerable impact on the adoption of Iceberg given a generic data-lake architecture. Are these all false impressions/ assumptions I'm making here? *Question*: Should Iceberg concern itself with supporting such use-cases to accommodate embedding it in a generic data-lake architecture in the first place, thinking solely from an adoption pov? If anyone else has been giving some thought to this and maybe either figured some stuff out or wants to share ideas on the topic please do. [1] https://github.com/apache/incubator-iceberg/pull/201 -- /Filip