Thanks Ryan/Szehon for responding and clarifying some concepts for me. I got my answer but just to further clarify on transactions, (not urgent but if you got time)
1) Isolation level is only check during commit. 2) does cow or merge on read has any bearing on how this transaction semantic works ? 3) for multiple writers scenario does 1. for snapshot isolation: Ryan how does your example scenario play differently if isolation level is snapshot? 2. instead of adding two new files to a snapshot if both writers update a same file or same record with non conflicting changes then how these two isolation levels handles it differently at commit time? On Thu, May 4, 2023 at 3:22 PM Szehon Ho <szehon...@apple.com.invalid> wrote: > Whoops, I didn’t see Ryan answer already. > > On May 4, 2023, at 3:18 PM, Szehon Ho <szehon...@apple.com.INVALID> wrote: > > Hi, > > I believe it only matters if you have conflicting commits. For single > writer case, I think you are right and it should not matter, so you may > save very slightly in performance by turning it to Snapshot Isolation. The > checks are metadata checks though, so I would think it will not be a > signfiicant performance difference. > > In general, the isolation levels in Iceberg work by checking before commit > to see if there are any conflicting changes to data files about to be > committed, from when the operation first started (ie, starting snapshot > id). So if there is a failure due to the isolation level, I believe the > error bubbles back the application to try again, hence ‘pessimistic’. > > Note, metadata conflicts are automatically retried and should rarely > bubble up to user, so only in case of data isolation level conflict (ie, > you delete a file that is currently being rewritten by another operation), > will error-handling be required. > > Hope that helps > Szehon > > On May 4, 2023, at 12:19 PM, Nirav Patel <nira...@gmail.com> wrote: > > I am trying to ingest data into iceberg table using spark streaming. There > are no multiple writers to same data at the moment. According to iceberg > api > <https://iceberg.apache.org/javadoc/0.11.0/org/apache/iceberg/IsolationLevel.html#:%7E:text=Both%20of%20them%20provide%20a,environments%20with%20many%20concurrent%20writers.> > default > isolation level for table is serializable . I want to understand if there > is only a single application (single spark streaming job in my case) > writing to iceberg table is there any advantage or disadvantage over using > serializable or a snapshot isolation ? Is there any performance impact of > using serializable when only one application is writing to table? Also it > seems iceberg allows all writers to write into snapshot and use OCC to > decide if one needs to retry because it was late. In this case how it is > serializable at all? isn't serilizability achieved via > pessimistic concurrency control? Would like to understand how iceberg > implement serializable isolation level and how it is different than > snapshot isolation ? > > Thanks > > > >