Re: [Discuss] Global Snapshot Consistency for Iceberg Tables

Xiening Dai Fri, 19 Jun 2026 09:23:12 -0700

Hi Andrei, thanks for you reply. Let me address some of your questions here, 
and we can discuss more in the community sync.


> it only works if we go catalog-as-source-of-truth.

I think we are already using catalog-as-source-of-truth. Today we rely it to 
provide the latest snapshot, as well as detect and resolve the write conflicts. 
Without catalog, we don't even have single table SI. 

> It breaks for federated/sharded/multi-region catalogs, or anywhere part of 
> the state
>    (e.g. data commits) lives in a different service

Yes, I agree with this concern. I think the fundamental question is what will 
be the boundary of SI. May be we should limit it to the same namespace since 
there's no such concept of database? Once we define this at conceptual level, 
the catalog implementation can adapt to it - for example metadata for the same 
namespace always live in the same partition.

> That needs a single transactional store behind the catalog. 

Yes, it needs to be a system of record. And I'd argue that it should always 
been the case even without this proposal as catalog is used to guarantee table 
update consistency.  But it doesn't have to be "single". Like I mentioned, if 
we define the transaction boundary within a namespace, we could have one 
transactional store per namespace. 

> Without it, "implementations must honor atomicity" quietly turns
> into sequential reads, which is the failure the doc already calls out.

Here I only proposal the IRC spec. When it comes down to engineering, yes, we 
need testing and reference implementation that actually enforce such semantics. 

On 2026/06/17 22:43:38 Andrei Tserakhau via dev wrote:
> Thanks for writing this up. +1 on bringing it to the sync.
> 
> A few points.
> 
>    1.
> 
>    The real question isn't the read API. It's where the order lives. If the
>    commit order (CSN/timestamp) is catalog-only -- returned by the catalog,
>    not written into object-storage metadata -- then the catalog is the only
>    thing that decides visibility, and consistent reads can be enforced. If
>    it's persisted into metadata.json, any reader can resolve metadata straight
>    from storage and skip the catalog. Then LoadTables/CSN only binds readers
>    that go through the catalog. That's cooperative, not enforceable. It works
>    for readers that play nice, not the rest. This is the same decision as the
>    v4 idea of dropping the root manifest and moving the snapshot root into the
>    catalog. Consistency is only enforceable once the catalog is the only thing
>    that resolves the current snapshot. So the "persist CSN in metadata.json vs
>    return it only in LoadTable" question and the v4 root-manifest direction
>    are one decision, and it only works if we go catalog-as-source-of-truth.
>    2.
> 
>    Batch LoadTables on a single SI read assumes one system of record. Fine
>    for Polaris-on-Postgres or single-region DynamoDB. It breaks for
>    federated/sharded/multi-region catalogs, or anywhere part of the state
>    (e.g. data commits) lives in a different service than the table metadata.
>    One SI read won't give you a consistent cross-store snapshot there. You
>    need explicit coordination -- paired pins, or a short quiescence. Either
>    scope that out, or say what the guarantee is when it doesn't hold.
>    3.
> 
>    The whole feature rests on the catalog doing atomic multi-table
>    operations -- the multi-table commit on the write side, and the consistent
>    multi-table read. Reads can't be consistent if the writes weren't atomic,
>    which the doc already calls out as a prerequisite. That needs a single
>    transactional store behind the catalog. Not every catalog has one: a
>    file/Hadoop catalog is just pointer files in storage, Hive's commit path is
>    per-table, federated/multi-region catalogs span stores. Those can't provide
>    global SI at all. So it can't be a blanket spec guarantee -- it has to be a
>    capability the catalog advertises (the CAT/CSN mode in the doc already does
>    this). And the spec has to define what a client does when it's not
>    advertised: hard error, or fall back to per-table reads. Leaving it
>    undefined is the "syntactic sugar over sequential commits" trap the doc
>    warns about.
> 
> This primitive is good for more than joins. Replication and change feeds
> need the same thing -- a consistent snapshot at order N that you can ship
> or re-apply elsewhere, in order. The doc already says the catalog order is
> reusable for commit reports / event ordering; cross-region replication and
> CDC are the same need. Worth keeping in mind so the primitive is generic
> enough for it.
> 
> One concrete thing: pin down LoadTables as "equivalent to a single point in
> the catalog commit order," and add a conformance test -- a writer doing
> atomic multi-table commits, a reader loop checking it never sees a partial
> commit. Without it, "implementations must honor atomicity" quietly turns
> into sequential reads, which is the failure the doc already calls out.
> 
> Andrei
> 
> On Wed, Jun 17, 2026 at 11:18 AM Russell Spitzer <[email protected]>
> wrote:
> 
> > Always welcome, please do
> >
> > On Wed, Jun 17, 2026 at 12:55 PM Xiening Dai <[email protected]> wrote:
> >
> >> Is it ok that I put this topic into the community sync so we can get some
> >> tractions on this issue?
> >>
> >> On 2026/06/05 23:16:04 Xiening Dai wrote:
> >> > And I replied your comments in the doc. Thank you.
> >> >
> >> > On 2026/06/04 23:35:04 Maninder Parmar wrote:
> >> > > Hi Xiening,
> >> > > The LoadTables proposal above seems to address the problem of
> >> atomically
> >> > > reading the metadata.json across multiple tables "as of" a consistent
> >> time,
> >> > > the CSN proposal provides a detailed
> >> > > <
> >> https://docs.google.com/document/d/1KVgUJc1WgftHfLz118vMbEE7HV8_pUDk4s-GJFDyAOE/edit?tab=t.0#bookmark=id.ue33k3ujfi7s
> >> >explanation
> >> > > of how to achieve it.
> >> > > It does not require reading metadata.json N times for the single
> >> table or
> >> > > pinning the catalog state ( I have added comments and provided links
> >> to
> >> > > relevant sections). Also, there is no need to rewrite the artifacts
> >> > > (manifest/manifest lists) stored in cloud storage as the CSN lives
> >> only in
> >> > > the TableMetadata which is written only by the catalog for the REST
> >> > > catalogs.
> >> > >
> >> > > The rest of the proposal aligns closely with the CSN proposal
> >> described here
> >> > > <
> >> https://docs.google.com/document/d/1KVgUJc1WgftHfLz118vMbEE7HV8_pUDk4s-GJFDyAOE/edit?tab=t.0#heading=h.nwyigim62nez
> >> >
> >> > > .
> >> > >
> >> > > Thanks,
> >> > > Maninder
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On Wed, Jun 3, 2026 at 8:59 AM Xiening Dai <[email protected]> wrote:
> >> > >
> >> > > > Hi all,
> >> > > >
> >> > > > Today, the Iceberg spec has table properties defining the
> >> transaction
> >> > > > isolation levels: write.delete/update/merge.isolation-level. These
> >> > > > properties can be set to either `snapshot` or `serializable`. With a
> >> > > > properly designed writer and Iceberg multi version snapshots, we can
> >> > > > achieve single table snapshot isolation or even serializable
> >> isolation.
> >> > > >
> >> > > > But for queries involving multiple tables, the spec does not
> >> provide a
> >> > > > mechanism to achieve a global snapshot consistency. The Iceberg REST
> >> > > > Catalog (IRC) API provides only single-table load operation:
> >> LoadTable, and
> >> > > > clients would need to call this API multiple times to resolve table
> >> > > > metadata in a single query statement - each could represent a
> >> different
> >> > > > snapshot view of the catalog.
> >> > > >
> >> > > > This creates problem especially for engines that already support
> >> global
> >> > > > SI. For example, the transaction semantics for AWS Redshift when
> >> query its
> >> > > > native tables is different than querying against Iceberg tables,
> >> which
> >> > > > surprises customers at times.
> >> > > >
> >> > > > There were proposals in the past in the context of multi-statement
> >> > > > transaction discussion (
> >> > > >
> >> https://docs.google.com/document/d/1jr4Ah8oceOmo6fwxG_0II4vKDUHUKScb/edit#heading=h.qb9z621zr507
> >> ).
> >> > > > But I feel these proposals are too complicated and require
> >> significant
> >> > > > changes to the catalog/IRC protocol.
> >> > > >
> >> > > > Here I propose a simpler approach: add a batch LoadTables API, and
> >> rely on
> >> > > > the catalog's underlying system-of-record to provide snapshot
> >> isolation for
> >> > > > that batch read.
> >> > > >
> >> > > > When a client calls LoadTables({table_a, table_b, table_c}), the
> >> catalog
> >> > > > reads the current metadata for all requested tables in a single
> >> consistent
> >> > > > operation (e.g., a TransactGetItems in DynamoDB, or a single SI
> >> read in a
> >> > > > relational DB). The client receives a consistent cross-table
> >> snapshot — the
> >> > > > latest committed state of all requested tables as of a single point
> >> in time.
> >> > > >
> >> > > > This would give us the statement level global snapshot consistency.
> >> It
> >> > > > doesn’t provide full transaction level SI consistency for multi
> >> statement
> >> > > > transactions, but I believe it’s a reasonable trade off.
> >> > > >
> >> > > > I capture the details of this proposal in this doc -
> >> > > >
> >> https://docs.google.com/document/d/1u11b4pzeFUKD0XX--nHPj-DoYcNeCgOe94WKCaX2XMI/edit?usp=sharing
> >> > > >
> >> > > > I also created a prototype that implements the LoadTables API for
> >> Apache
> >> > > > Polaris, levering the underlying Postgres for the snapshot
> >> isolation -
> >> > > >
> >> https://github.com/xndai/polaris/commit/f4eb514a2920effe67ecfb8c64e2e3fa418baf11
> >> > > >
> >> > > > Feedbacks and comments are welcomed!
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: [Discuss] Global Snapshot Consistency for Iceberg Tables

Reply via email to