[GitHub] [iceberg] CodingCat commented on pull request #4795: expose the latest snapshot id committed within a thread

GitBox Sun, 22 May 2022 17:59:30 -0700


CodingCat commented on PR #4795:
URL: https://github.com/apache/iceberg/pull/4795#issuecomment-1134046692


   Hi, @kbendick 
   
   Thanks for looking at this! I found there are some confusion about what I 
want to achieve here, please check the inlined answer
   
   > I’m not sure I understand what you’re trying to achieve. I understand 
you’d like to expose the snapshotId as it was when the current thread (or let’s 
just say writer) prepared its write, ie prior to the commit.
   
   I am trying to expose the snapshot id *AFTER* the commit, you can check the 
change at 
[here](https://github.com/apache/iceberg/pull/4795/files#diff-3c27b7c7615fcf206c99e3e441c0621561c0a19ac3e9a26bb142329879425151R332
   ) 
   
   the problem I am trying to resolve is from the following workflow
   
   ```
   ThreadA committed 
   
   ThreadB committed
   
   ThreadA want to do something with the snapshot it just committed
   
   ```
   
   as I said in the description, currently there is no way for ThreadA to do 
this, as long as it calls currentSnapshotId(), it will get the latest 
snapshotId generated by ThreadB because of the refreshing of metadata. this PR 
is just to expose this information to ThreadA
   
   it will not break ACID, as if ThreadA committed another version, it will be 
still based on the latest snapshot id since I didn't change that part at all
   
   
   > 
   > > I think the scenario is more pervasive than our own case, e.g. each 
notebook attached to the Databricks' notebook cluster is basically handled by a 
thread. In such an scenario, users may fall into some race condition to get the 
snapshot id committed by their own notebook with just 
currentSnapshot().snapshotId
   > 
   > What catalog are you using? You mention Databricks, and most people I’ve 
encountered using Iceberg on Databricks are using the `HadoopCatalog`. Which 
should _not_ be used in a production environment as there’s no locking 
mechanism to keep the current snapshot updateable by only one writer at a time 
(be it across threads or across Spark applications).
   > 
   > It sounds like maybe you’re trying to get around the lack of a lock, but I 
worry that you’ll have conflicting writes and clobber the previous writers work.
   
   I am using HadoopCatalog but not really getting around of lack of lock here 
(the lock problem was resolved by https://github.com/apache/iceberg/pull/3663 
where you can use some DynamoDB based lockManager with HadoopTableOperation
   
   
   > 
   > What do you intend to do with this thread local snapshot Id (particularly 
once it becomes outdated via some other writer).
   
   we have multiple scenarios needing this local snapshot id, the simplest 
example might be we have multiple threads committing to the same table, we want 
to logging this information for easier debugging (with information like which 
thread generated which version)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] CodingCat commented on pull request #4795: expose the latest snapshot id committed within a thread

Reply via email to