Hi everyone, I enjoyed the discussion/feedback during Iceberg community sync. I am going through feedback and updating my proposal - adding the "Catalog GlobalSequenceNumber" option, and describing how it would look, detailing the use cases better in the document, and rest of feedback. Please also share via email if anyone has additional feedback not discussed in the sync.
There was discussion in the meeting that read Snapshot Isolation is only required for tables if they were committed using "Multi table commit" API. I wanted to discuss if that is really the case by providing an example. Let's assume, a user has a "customers" table and a "orders" table in the catalog. They have two different jobs that do this work repeatedly - Job A) Gets the information about new customers and adds them to the customers table. Job B) Gets the customers in the customers table and refreshes their order information from some source. In this case, both the jobs do not commit data via the Multi Table commit API. But because of the logic of the jobs, users running Snapshot Isolation query would expect that the "orders" table only has orders for customers that exist in the "customers" table. Job B only refreshes orders for customers which exist in the customers table. Without a way to provide Snapshot isolation, lets say user runs this RO transaction: BEGIN; SELECT <some computation> from customers where <some clause that filters customers>; SELECT <some computation> from orders where .... ; For this transaction, the following sequence is possible: Time 1) Transaction loads customers table and filters customers. Time 2) Job A updates customers table and commits Time 3) Job B refreshes orders, including new customers added at Time 2 by Job A, and commits Time 4) Transaction loads orders table, it has orders for customers loaded at Time 3 by Job B Now this transaction is seeing "orders" for customers that don't exist in the "customers" table. Given the logic of Job A and B, this would not happen in Snapshot Isolation read. Does this example make sense of a scenario where even though each commit was not using multi-table commit API, the user may want a SI read against the 2 tables involved. Looking forward to hearing what folks think? -Jagdeep On Wed, Apr 23, 2025 at 3:29 PM Jagdeep Sidhu <sidhujagde...@gmail.com> wrote: > Hi Amogh/Drew/Fokko and Iceberg community, > > Thank you for the feedback. As you point out, I will update the document > and add a section to clarify what we mean by each isolation level. I read > through the linked PR as well, this proposal is along the lines of *"1. > Coordinate table loading with the catalog"* description by Ryan in his > comment. > > I agree with your skepticism that most users are not doing multi-table > multi-statement transactions, but I also believe that we need to build > better support for that before we see more usage. Most users are operating > on the tables using a single engine - most likely Spark. We have been > hearing from S3 Table customers that they would really like to work with > multiple engines on their tables. For example, customers running large > Spark clusters who also want to operate on their tables by using say DuckDB > or Daft. My starting point for the proposal is how to make it easy for > engines to build multi-table multi-statement transactions, when multiple > engines operate on same tables concurrently. How do you folks think we > should address this? > > I also want to discuss what we think of "Snapshot Isolation". From > https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-95-51.pdf > : > > *"These discussions naturally suggest an isolation level, called Snapshot > Isolation, in which each transaction reads reads data from a snapshot of > the (committed) data as of the time the transaction started, called its > Start-Timestamp."* > > In my proposal, the initial handshake between client and Catalog is to > establish this "Start-Timestamp". I tried to brainstorm how would SI work > without "authoritative point in time view". > Let's take an example of interactive transaction by user: > BEGIN; > SELECT .... from T1; // Statement 1 > SELECT ...... from T1 INNER JOIN T2 ...; // Statement 2 > > In this case, lets say statement 1 took 10 mins to run. Then engine tries > to load T2 to run next statement, and it has no way of finding out exactly > which snapshot of T2 to load. I clarify in my proposal why timestamps would > actually not help engine in this case. > > Amogh, as you discuss (and the PR discusses) the engine can just load T2's > current state, and then verify that T1 has not changed - Hence not > requiring "authoritative point in time view", I think that has few problems: > 1. The engine is not operating on snapshot as of transaction start time. > If T2 was written in last minute, the transaction is now operating on state > as of 1 min, not as of its start time. > 2. If T1 has changed, then engine has to fail the transaction. This would > be a bad user experience, as a Read-Only query using Snapshot Isolation > should not have conflicts by definition - it simply reads from database > state as of Transaction "Start-Timestamp". > > My worry is that this model would lead to aborting long running read-only > transactions, which Snapshot Isolation is very good for - Read only queries > do not conflict under SI. Hence my proposal. > > What are your thoughts on this? > > I really appreciate everyone's time and input on this proposal. > > Thank you! > -Jagdeep > > On Tue, Apr 22, 2025 at 10:59 PM Amogh Jahagirdar <2am...@gmail.com> > wrote: > >> Hi Jagdeep, >> >> >> Thanks for the proposal! I agree with Drew’s second point emphasis on >> clarifying the goals for snapshot and serializable isolation guarantees >> before committing to catalog changes. >> >> >> Note a lot of the following rationale is also from this Github PR >> <https://github.com/apache/iceberg/pull/6948/files#r1244026460>a while >> back which I think summarizes the crux of the issue pretty well. >> >> >> The current proposal advocates for strict multi-table snapshot isolation, >> treating the catalog as the authority for transactional point-in-time >> consistency. While this approach may end up being necessary, I suggest we >> first formalize what “snapshot isolation” and “serializable” concretely >> mean in Iceberg’s context. For example, >> >> >> Snapshot Isolation: Prevent dirty/non-repeatable reads, lost updates, and >> phantom reads across tables. >> >> >> Serializable: Add prevention of write skew to the SI definition. >> >> >> The phantom read example in the proposal’s "Limitation 1" illustrates a >> clear anomaly that snapshot isolation must prevent. However, the proposal >> assumes the catalog must enforce point-in-time consistency in order for >> Iceberg to support multi-statement/table transactions. Point in time across >> entities in catalog is made difficult by the fact that there isn't a causal >> ordering between commits across tables, but multi-table transactions can >> inherently have this, so catalogs themselves may not even be reliable for >> such a notion. See >> https://github.com/apache/iceberg/pull/6948/files#r1251319541 >> <https://github.com/apache/iceberg/pull/6948/files#r1251319541> >> >> >> Clients may not need an authoritative point in time view to ensure an >> isolation guarantee so long as the overall outcome of the transaction >> aligns with isolation guarantees. So long as the client can guarantee >> through existing mechanisms that the outcome of any operation is correct as >> per the isolation level (i.e. verifying no read data in the transaction has >> been modified) there may not be a need for catalogs to keep track of all >> that state. This moves the snapshot isolation definition closer to >> serializable isolation, but I think that's fine since levels can always be >> stricter and then we relax the use cases which we know we should support >> for snapshot isolation but not serializable. >> >> >> To be convinced that catalog changes are required, I think I'd need to >> see more examples for where it's either really complicated on the client to >> do that validation, not scalable on the client (I'm skeptical users are >> really doing multi table/statement transactions over many tables, but I can >> be wrong), or it's somehow insufficient from a correctness perspective >> considering the previously mentioned alternative of client side doing a >> more "serializable isolation" like validation. Let me know if I'm missing >> something! >> >> >> >> Thanks, >> >> >> Amogh Jahagirdar >> >