Re: [Discuss] Global Snapshot Consistency for Iceberg Tables

Andrei Tserakhau via dev Wed, 17 Jun 2026 15:44:06 -0700

Thanks for writing this up. +1 on bringing it to the sync.

A few points.

   1.

   The real question isn't the read API. It's where the order lives. If the
   commit order (CSN/timestamp) is catalog-only -- returned by the catalog,
   not written into object-storage metadata -- then the catalog is the only
   thing that decides visibility, and consistent reads can be enforced. If
   it's persisted into metadata.json, any reader can resolve metadata straight
   from storage and skip the catalog. Then LoadTables/CSN only binds readers
   that go through the catalog. That's cooperative, not enforceable. It works
   for readers that play nice, not the rest. This is the same decision as the
   v4 idea of dropping the root manifest and moving the snapshot root into the
   catalog. Consistency is only enforceable once the catalog is the only thing
   that resolves the current snapshot. So the "persist CSN in metadata.json vs
   return it only in LoadTable" question and the v4 root-manifest direction
   are one decision, and it only works if we go catalog-as-source-of-truth.
   2.

   Batch LoadTables on a single SI read assumes one system of record. Fine
   for Polaris-on-Postgres or single-region DynamoDB. It breaks for
   federated/sharded/multi-region catalogs, or anywhere part of the state
   (e.g. data commits) lives in a different service than the table metadata.
   One SI read won't give you a consistent cross-store snapshot there. You
   need explicit coordination -- paired pins, or a short quiescence. Either
   scope that out, or say what the guarantee is when it doesn't hold.
   3.

   The whole feature rests on the catalog doing atomic multi-table
   operations -- the multi-table commit on the write side, and the consistent
   multi-table read. Reads can't be consistent if the writes weren't atomic,
   which the doc already calls out as a prerequisite. That needs a single
   transactional store behind the catalog. Not every catalog has one: a
   file/Hadoop catalog is just pointer files in storage, Hive's commit path is
   per-table, federated/multi-region catalogs span stores. Those can't provide
   global SI at all. So it can't be a blanket spec guarantee -- it has to be a
   capability the catalog advertises (the CAT/CSN mode in the doc already does
   this). And the spec has to define what a client does when it's not
   advertised: hard error, or fall back to per-table reads. Leaving it
   undefined is the "syntactic sugar over sequential commits" trap the doc
   warns about.

This primitive is good for more than joins. Replication and change feeds
need the same thing -- a consistent snapshot at order N that you can ship
or re-apply elsewhere, in order. The doc already says the catalog order is
reusable for commit reports / event ordering; cross-region replication and
CDC are the same need. Worth keeping in mind so the primitive is generic
enough for it.

One concrete thing: pin down LoadTables as "equivalent to a single point in
the catalog commit order," and add a conformance test -- a writer doing
atomic multi-table commits, a reader loop checking it never sees a partial
commit. Without it, "implementations must honor atomicity" quietly turns
into sequential reads, which is the failure the doc already calls out.

Andrei

On Wed, Jun 17, 2026 at 11:18 AM Russell Spitzer <[email protected]>
wrote:

> Always welcome, please do
>
> On Wed, Jun 17, 2026 at 12:55 PM Xiening Dai <[email protected]> wrote:
>
>> Is it ok that I put this topic into the community sync so we can get some
>> tractions on this issue?
>>
>> On 2026/06/05 23:16:04 Xiening Dai wrote:
>> > And I replied your comments in the doc. Thank you.
>> >
>> > On 2026/06/04 23:35:04 Maninder Parmar wrote:
>> > > Hi Xiening,
>> > > The LoadTables proposal above seems to address the problem of
>> atomically
>> > > reading the metadata.json across multiple tables "as of" a consistent
>> time,
>> > > the CSN proposal provides a detailed
>> > > <
>> https://docs.google.com/document/d/1KVgUJc1WgftHfLz118vMbEE7HV8_pUDk4s-GJFDyAOE/edit?tab=t.0#bookmark=id.ue33k3ujfi7s
>> >explanation
>> > > of how to achieve it.
>> > > It does not require reading metadata.json N times for the single
>> table or
>> > > pinning the catalog state ( I have added comments and provided links
>> to
>> > > relevant sections). Also, there is no need to rewrite the artifacts
>> > > (manifest/manifest lists) stored in cloud storage as the CSN lives
>> only in
>> > > the TableMetadata which is written only by the catalog for the REST
>> > > catalogs.
>> > >
>> > > The rest of the proposal aligns closely with the CSN proposal
>> described here
>> > > <
>> https://docs.google.com/document/d/1KVgUJc1WgftHfLz118vMbEE7HV8_pUDk4s-GJFDyAOE/edit?tab=t.0#heading=h.nwyigim62nez
>> >
>> > > .
>> > >
>> > > Thanks,
>> > > Maninder
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Wed, Jun 3, 2026 at 8:59 AM Xiening Dai <[email protected]> wrote:
>> > >
>> > > > Hi all,
>> > > >
>> > > > Today, the Iceberg spec has table properties defining the
>> transaction
>> > > > isolation levels: write.delete/update/merge.isolation-level. These
>> > > > properties can be set to either `snapshot` or `serializable`. With a
>> > > > properly designed writer and Iceberg multi version snapshots, we can
>> > > > achieve single table snapshot isolation or even serializable
>> isolation.
>> > > >
>> > > > But for queries involving multiple tables, the spec does not
>> provide a
>> > > > mechanism to achieve a global snapshot consistency. The Iceberg REST
>> > > > Catalog (IRC) API provides only single-table load operation:
>> LoadTable, and
>> > > > clients would need to call this API multiple times to resolve table
>> > > > metadata in a single query statement - each could represent a
>> different
>> > > > snapshot view of the catalog.
>> > > >
>> > > > This creates problem especially for engines that already support
>> global
>> > > > SI. For example, the transaction semantics for AWS Redshift when
>> query its
>> > > > native tables is different than querying against Iceberg tables,
>> which
>> > > > surprises customers at times.
>> > > >
>> > > > There were proposals in the past in the context of multi-statement
>> > > > transaction discussion (
>> > > >
>> https://docs.google.com/document/d/1jr4Ah8oceOmo6fwxG_0II4vKDUHUKScb/edit#heading=h.qb9z621zr507
>> ).
>> > > > But I feel these proposals are too complicated and require
>> significant
>> > > > changes to the catalog/IRC protocol.
>> > > >
>> > > > Here I propose a simpler approach: add a batch LoadTables API, and
>> rely on
>> > > > the catalog's underlying system-of-record to provide snapshot
>> isolation for
>> > > > that batch read.
>> > > >
>> > > > When a client calls LoadTables({table_a, table_b, table_c}), the
>> catalog
>> > > > reads the current metadata for all requested tables in a single
>> consistent
>> > > > operation (e.g., a TransactGetItems in DynamoDB, or a single SI
>> read in a
>> > > > relational DB). The client receives a consistent cross-table
>> snapshot — the
>> > > > latest committed state of all requested tables as of a single point
>> in time.
>> > > >
>> > > > This would give us the statement level global snapshot consistency.
>> It
>> > > > doesn’t provide full transaction level SI consistency for multi
>> statement
>> > > > transactions, but I believe it’s a reasonable trade off.
>> > > >
>> > > > I capture the details of this proposal in this doc -
>> > > >
>> https://docs.google.com/document/d/1u11b4pzeFUKD0XX--nHPj-DoYcNeCgOe94WKCaX2XMI/edit?usp=sharing
>> > > >
>> > > > I also created a prototype that implements the LoadTables API for
>> Apache
>> > > > Polaris, levering the underlying Postgres for the snapshot
>> isolation -
>> > > >
>> https://github.com/xndai/polaris/commit/f4eb514a2920effe67ecfb8c64e2e3fa418baf11
>> > > >
>> > > > Feedbacks and comments are welcomed!
>> > > >
>> > >
>> >
>>
>

Re: [Discuss] Global Snapshot Consistency for Iceberg Tables

Reply via email to