Hi Amogh/Drew/Fokko and Iceberg community,

Thank you for the feedback. As you point out, I will update the document
and add a section to clarify what we mean by each isolation level. I read
through the linked PR as well, this proposal is along the lines of *"1.
Coordinate table loading with the catalog"* description by Ryan in his
comment.

I agree with your skepticism that most users are not doing multi-table
multi-statement transactions, but I also believe that we need to build
better support for that before we see more usage. Most users are operating
on the tables using a single engine - most likely Spark. We have been
hearing from S3 Table customers that they would really like to work with
multiple engines on their tables. For example, customers running large
Spark clusters who also want to operate on their tables by using say DuckDB
or Daft. My starting point for the proposal is how to make it easy for
engines to build multi-table multi-statement transactions, when multiple
engines operate on same tables concurrently. How do you folks think we
should address this?

I also want to discuss what we think of "Snapshot Isolation". From
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-95-51.pdf
:

*"These discussions naturally suggest an isolation level, called Snapshot
Isolation, in which each transaction reads reads data from a snapshot of
the (committed) data as of the time the transaction started, called its
Start-Timestamp."*

In my proposal, the initial handshake between client and Catalog is to
establish this "Start-Timestamp". I tried to brainstorm how would SI work
without "authoritative point in time view".
Let's take an example of interactive transaction by user:
BEGIN;
SELECT .... from T1; // Statement 1
SELECT ...... from T1 INNER JOIN T2 ...; // Statement 2

In this case, lets say statement 1 took 10 mins to run. Then engine tries
to load T2 to run next statement, and it has no way of finding out exactly
which snapshot of T2 to load. I clarify in my proposal why timestamps would
actually not help engine in this case.

Amogh, as you discuss (and the PR discusses) the engine can just load T2's
current state, and then verify that T1 has not changed - Hence not
requiring "authoritative point in time view", I think that has few problems:
1. The engine is not operating on snapshot as of transaction start time. If
T2 was written in last minute, the transaction is now operating on state as
of 1 min, not as of its start time.
2. If T1 has changed, then engine has to fail the transaction. This would
be a bad user experience, as a Read-Only query using Snapshot Isolation
should not have conflicts by definition - it simply reads from database
state as of Transaction "Start-Timestamp".

My worry is that this model would lead to aborting long running read-only
transactions, which Snapshot Isolation is very good for - Read only queries
do not conflict under SI. Hence my proposal.

What are your thoughts on this?

I really appreciate everyone's time and input on this proposal.

Thank you!
-Jagdeep

On Tue, Apr 22, 2025 at 10:59 PM Amogh Jahagirdar <2am...@gmail.com> wrote:

> Hi Jagdeep,
>
>
> Thanks for the proposal! I agree with Drew’s second point emphasis on
> clarifying the goals for snapshot and serializable isolation guarantees
> before committing to catalog changes.
>
>
> Note a lot of the following rationale is also from this Github PR
> <https://github.com/apache/iceberg/pull/6948/files#r1244026460>a while
> back which I think summarizes the crux of the issue pretty well.
>
>
> The current proposal advocates for strict multi-table snapshot isolation,
> treating the catalog as the authority for transactional point-in-time
> consistency. While this approach may end up being necessary, I suggest we
> first formalize what “snapshot isolation” and “serializable” concretely
> mean in Iceberg’s context. For example,
>
>
> Snapshot Isolation: Prevent dirty/non-repeatable reads, lost updates, and
> phantom reads across tables.
>
>
> Serializable: Add prevention of write skew to the SI definition.
>
>
> The phantom read example in the proposal’s "Limitation 1" illustrates a
> clear anomaly that snapshot isolation must prevent. However, the proposal
> assumes the catalog must enforce point-in-time consistency in order for
> Iceberg to support multi-statement/table transactions. Point in time across
> entities in catalog is made difficult by the fact that there isn't a causal
> ordering between commits across tables, but multi-table transactions can
> inherently have this, so catalogs themselves may not even be reliable for
> such a notion. See
> https://github.com/apache/iceberg/pull/6948/files#r1251319541
> <https://github.com/apache/iceberg/pull/6948/files#r1251319541>
>
>
> Clients may not need an authoritative point in time view to ensure an
> isolation guarantee so long as the overall outcome of the transaction
> aligns with isolation guarantees. So long as the client can guarantee
> through existing mechanisms that the outcome of any operation is correct as
> per the isolation level (i.e. verifying no read data in the transaction has
> been modified) there may not be a need for catalogs to keep track of all
> that state. This moves the snapshot isolation definition closer to
> serializable isolation, but I think that's fine since levels can always be
> stricter and then we relax the use cases which we know we should support
> for snapshot isolation but not serializable.
>
>
> To be convinced that catalog changes are required, I think I'd need to see
> more examples for where it's either really complicated on the client to do
> that validation, not scalable on the client (I'm skeptical users are really
> doing multi table/statement transactions over many tables, but I can be
> wrong), or it's somehow insufficient from a correctness perspective
> considering the previously mentioned alternative of client side doing a
> more "serializable isolation" like validation. Let me know if I'm missing
> something!
>
>
>
> Thanks,
>
>
> Amogh Jahagirdar
>

Reply via email to