Hi devs,

I need help trying to figure out if and how adopting Iceberg fits into the
picture of a generic architecture of a data-lake. The use-case is very
broad so please excuse the abrupt and naive approach of attempting to
summarize in a short email but I'll start with a rundown of the general
use-case and try to narrow it down a bit by the time I ask specific
questions about Iceberg support...

A generic approach to the architecture of a data-lake generally involves
(at least) two stages for landing data before making it accessible for
querying (terminology varies, a lot, ranging from zones, raw vs processed
stores, ingestion tier and insights tier, etc.).
Data is usually undergoing a particular set of transformations across these
stages either successfully advancing to the next stage or forfeiting the
promotion process - in either case there's a metadata operation involved
recording status.
When such a transformation is successful data is generally promoted to the
next stage via a data move operation or metadata operation depending on the
underlying file system implementation - either way it's a file path change.

Adopting Iceberg as a data writer in any of the earlier stages would imply
promoting Iceberg table changes across along with the promotion of the
actual data files - so that eventually consumers can reliably make use of
the Iceberg format.

Taking the naive approach of having corresponding Iceberg tables across
various stages I was wondering if there's any support for promoting commits
across two Iceberg tables just by "tweaking" file paths for data files as a
meta-data operation alone? Is that achievable with Iceberg today? This
relates to my question on extending the API support based on Ryan's PR I
brought up initially in the message, for this particular use-case
supporting only append files commits would suffice.

As a side note (but probably having considerable implications to the topic
at hand) - after reading up on the "Updates/Deletes/Upserts in Iceberg" and
trying to reason about the implications of implementing it I kind of got
the feeling file paths become an Iceberg concern entirely, totally opaque
to the consumer, either from the write or read path. Also I believe that
data compaction can no longer be an external process either and it'd have
to understand Iceberg data file semantics. These two implications would
have a considerable impact on the adoption of Iceberg given a generic
data-lake architecture. Are these all false impressions/ assumptions I'm
making here?

*Question*: Should Iceberg concern itself with supporting such use-cases to
accommodate embedding it in a generic data-lake architecture in the first
place, thinking solely from an adoption pov?

If anyone else has been giving some thought to this and maybe either
figured some stuff out or wants to share ideas on the topic please do.

[1] https://github.com/apache/incubator-iceberg/pull/201

-- 
/Filip

Reply via email to