Re: Discuss proposal - IRC APIs for Multi-Statement Multi-Table Transactions

Jagdeep Sidhu Mon, 05 May 2025 12:30:02 -0700

Hi Maninder,

I believe the two approaches are functionally equivalent, "Global
CatalogSequenceNumbers" or "Catalog timestamps" that you propose. The
difference is simply whether we use an actual timestamp or some
monotonically increasing sequence/logical clock. I think we should use
sequence numbers since they are easier to reason as most databases use
them, e.g. PostgreSQL transaction IDs. Further, a single Catalog is most
likely backed by multiple hosts, each with slightly different clock skew -
so "Catalog timestamp" will give the wrong impression. Catalogs would end
up with complex logic where each host will have to check the timestamp of
last commit across all tables and then advance its clock - if this catalog
server host is running behind due to clock skew. Or what happens when some
hosts serving Catalog have high clock skews.


If we implement SequenceNumber in Catalog, then each transaction commits at
a particular SequenceNumber. A Read transaction reads from a particular
SequenceNumber and ignores commits with higher SequenceNumber. As you said,
multi-table commits should have the same SequenceNumber for all table
commits, so that they are atomic across tables.

Further, I think using an actual timestamp could be confusing as the
client/engines also write some commit timestamps and evaluation functions
like NOW() locally, so Logical clock/SequenceNumber would be a more clear
approach to establish order of events across tables. What do you think?

-Jagdeep

On Thu, May 1, 2025 at 5:49 AM Maninderjit Singh <
parmar.maninder...@gmail.com> wrote:

> Thanks for starting the discussion!
> I wanted to discuss more about the "Catalog GlobalSequenceNumber". I think
> we not only need to order the snapshots across tables but also assign them
> some timestamp so that time travel queries also work consistently across
> tables. In other words, "Catalog Global sequence number" has to be of the
> form of a timestamp (say "catalog timestamp") which should be same for all
> the snapshots written in same transaction across tables and unique per
> transaction.
>
> Since in the Rest protocol, the catalog acts as a centralized transaction
> manager so it should be the responsibility of the catalog to correctly
> write these "catalog timestamp" in the metadata.json. The "catalog
> timestamps" should also play well with existing timestamps in the
> metadata.json (snapshot logs ) and snapshot timestamp since it would be
> confusing to have multiple notions of time that could imply different
> orderings of snapshots.
>
> I think we should also consider reusing the existing timestamps in the
> metadata.json with following behavior modifications and clarifications:
> 1. Catalog is responsible for writing timestamps (Snapshot log and
> snapshot timestamp) in metadata.json
> 2. Timestamp corresponds to the commit time on the catalog's clock
> 3. Snapshots written by same transaction have same timestamp across all
> tables in the catalog.
> 4. Snapshots written by different transactions would have different
> timestamps. Ordering of the timestamps implies the ordering of Snapshots.
> 5. Timestamp information in manifest list file is optional and not used in
> metadata.json or for ordering the snapshots.
>
> Thanks,
> Maninder
>
>
> On Wed, Apr 30, 2025, 3:44 PM Jagdeep Sidhu <sidhujagde...@gmail.com>
> wrote:
>
>> Hi everyone,
>>
>> I enjoyed the discussion/feedback during Iceberg community sync. I am
>> going through feedback and updating my proposal - adding the "Catalog
>> GlobalSequenceNumber" option, and describing how it would look, detailing
>> the use cases better in the document, and rest of feedback. Please also
>> share via email if anyone has additional feedback not discussed in the sync.
>>
>> There was discussion in the meeting that read Snapshot Isolation is only
>> required for tables if they were committed using "Multi table commit" API.
>> I wanted to discuss if that is really the case by providing an example.
>>
>> Let's assume, a user has a "customers" table and a "orders" table in the
>> catalog. They have two different jobs that do this work repeatedly -
>>
>> Job A) Gets the information about new customers and adds them to the
>> customers table.
>> Job B) Gets the customers in the customers table and refreshes their
>> order information from some source.
>>
>> In this case, both the jobs do not commit data via the Multi Table commit
>> API. But because of the logic of the jobs, users running Snapshot Isolation
>> query would expect that the "orders" table only has orders for customers
>> that exist in the "customers" table. Job B only refreshes orders for
>> customers which exist in the customers table.
>>
>> Without a way to provide Snapshot isolation, lets say user runs this RO
>> transaction:
>>
>> BEGIN;
>> SELECT <some computation> from customers where <some clause that filters
>> customers>;
>> SELECT <some computation> from orders where .... ;
>>
>> For this transaction, the following sequence is possible:
>>
>> Time 1) Transaction loads customers table and filters customers.
>> Time 2) Job A updates customers table and commits
>> Time 3) Job B refreshes orders, including new customers added at Time 2
>> by Job A, and commits
>> Time 4) Transaction loads orders table, it has orders for customers
>> loaded at Time 3 by Job B
>>
>> Now this transaction is seeing "orders" for customers that don't exist in
>> the "customers" table. Given the logic of Job A and B, this would not
>> happen in Snapshot Isolation read.
>>
>> Does this example make sense of a scenario where even though each commit
>> was not using multi-table commit API, the user may want a SI read against
>> the 2 tables involved. Looking forward to hearing what folks think?
>>
>> -Jagdeep
>>
>> On Wed, Apr 23, 2025 at 3:29 PM Jagdeep Sidhu <sidhujagde...@gmail.com>
>> wrote:
>>
>>> Hi Amogh/Drew/Fokko and Iceberg community,
>>>
>>> Thank you for the feedback. As you point out, I will update the document
>>> and add a section to clarify what we mean by each isolation level. I read
>>> through the linked PR as well, this proposal is along the lines of *"1.
>>> Coordinate table loading with the catalog"* description by Ryan in his
>>> comment.
>>>
>>> I agree with your skepticism that most users are not doing multi-table
>>> multi-statement transactions, but I also believe that we need to build
>>> better support for that before we see more usage. Most users are operating
>>> on the tables using a single engine - most likely Spark. We have been
>>> hearing from S3 Table customers that they would really like to work with
>>> multiple engines on their tables. For example, customers running large
>>> Spark clusters who also want to operate on their tables by using say DuckDB
>>> or Daft. My starting point for the proposal is how to make it easy for
>>> engines to build multi-table multi-statement transactions, when multiple
>>> engines operate on same tables concurrently. How do you folks think we
>>> should address this?
>>>
>>> I also want to discuss what we think of "Snapshot Isolation". From
>>> https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-95-51.pdf
>>> :
>>>
>>> *"These discussions naturally suggest an isolation level, called
>>> Snapshot Isolation, in which each transaction reads reads data from a
>>> snapshot of the (committed) data as of the time the transaction started,
>>> called its Start-Timestamp."*
>>>
>>> In my proposal, the initial handshake between client and Catalog is to
>>> establish this "Start-Timestamp". I tried to brainstorm how would SI work
>>> without "authoritative point in time view".
>>> Let's take an example of interactive transaction by user:
>>> BEGIN;
>>> SELECT .... from T1; // Statement 1
>>> SELECT ...... from T1 INNER JOIN T2 ...; // Statement 2
>>>
>>> In this case, lets say statement 1 took 10 mins to run. Then engine
>>> tries to load T2 to run next statement, and it has no way of finding out
>>> exactly which snapshot of T2 to load. I clarify in my proposal why
>>> timestamps would actually not help engine in this case.
>>>
>>> Amogh, as you discuss (and the PR discusses) the engine can just load
>>> T2's current state, and then verify that T1 has not changed - Hence not
>>> requiring "authoritative point in time view", I think that has few problems:
>>> 1. The engine is not operating on snapshot as of transaction start time.
>>> If T2 was written in last minute, the transaction is now operating on state
>>> as of 1 min, not as of its start time.
>>> 2. If T1 has changed, then engine has to fail the transaction. This
>>> would be a bad user experience, as a Read-Only query using Snapshot
>>> Isolation should not have conflicts by definition - it simply reads from
>>> database state as of Transaction "Start-Timestamp".
>>>
>>> My worry is that this model would lead to aborting long running
>>> read-only transactions, which Snapshot Isolation is very good for - Read
>>> only queries do not conflict under SI. Hence my proposal.
>>>
>>> What are your thoughts on this?
>>>
>>> I really appreciate everyone's time and input on this proposal.
>>>
>>> Thank you!
>>> -Jagdeep
>>>
>>> On Tue, Apr 22, 2025 at 10:59 PM Amogh Jahagirdar <2am...@gmail.com>
>>> wrote:
>>>
>>>> Hi Jagdeep,
>>>>
>>>>
>>>> Thanks for the proposal! I agree with Drew’s second point emphasis on
>>>> clarifying the goals for snapshot and serializable isolation guarantees
>>>> before committing to catalog changes.
>>>>
>>>>
>>>> Note a lot of the following rationale is also from this Github PR
>>>> <https://github.com/apache/iceberg/pull/6948/files#r1244026460>a while
>>>> back which I think summarizes the crux of the issue pretty well.
>>>>
>>>>
>>>> The current proposal advocates for strict multi-table snapshot
>>>> isolation, treating the catalog as the authority for transactional
>>>> point-in-time consistency. While this approach may end up being necessary,
>>>> I suggest we first formalize what “snapshot isolation” and “serializable”
>>>> concretely mean in Iceberg’s context. For example,
>>>>
>>>>
>>>> Snapshot Isolation: Prevent dirty/non-repeatable reads, lost updates,
>>>> and phantom reads across tables.
>>>>
>>>>
>>>> Serializable: Add prevention of write skew to the SI definition.
>>>>
>>>>
>>>> The phantom read example in the proposal’s "Limitation 1" illustrates a
>>>> clear anomaly that snapshot isolation must prevent. However, the proposal
>>>> assumes the catalog must enforce point-in-time consistency in order for
>>>> Iceberg to support multi-statement/table transactions. Point in time across
>>>> entities in catalog is made difficult by the fact that there isn't a causal
>>>> ordering between commits across tables, but multi-table transactions can
>>>> inherently have this, so catalogs themselves may not even be reliable for
>>>> such a notion. See
>>>> https://github.com/apache/iceberg/pull/6948/files#r1251319541
>>>> <https://github.com/apache/iceberg/pull/6948/files#r1251319541>
>>>>
>>>>
>>>> Clients may not need an authoritative point in time view to ensure an
>>>> isolation guarantee so long as the overall outcome of the transaction
>>>> aligns with isolation guarantees. So long as the client can guarantee
>>>> through existing mechanisms that the outcome of any operation is correct as
>>>> per the isolation level (i.e. verifying no read data in the transaction has
>>>> been modified) there may not be a need for catalogs to keep track of all
>>>> that state. This moves the snapshot isolation definition closer to
>>>> serializable isolation, but I think that's fine since levels can always be
>>>> stricter and then we relax the use cases which we know we should support
>>>> for snapshot isolation but not serializable.
>>>>
>>>>
>>>> To be convinced that catalog changes are required, I think I'd need to
>>>> see more examples for where it's either really complicated on the client to
>>>> do that validation, not scalable on the client (I'm skeptical users are
>>>> really doing multi table/statement transactions over many tables, but I can
>>>> be wrong), or it's somehow insufficient from a correctness perspective
>>>> considering the previously mentioned alternative of client side doing a
>>>> more "serializable isolation" like validation. Let me know if I'm missing
>>>> something!
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> Amogh Jahagirdar
>>>>
>>>

Re: Discuss proposal - IRC APIs for Multi-Statement Multi-Table Transactions

Reply via email to