CodingCat commented on PR #4795: URL: https://github.com/apache/iceberg/pull/4795#issuecomment-1134046692
Hi, @kbendick Thanks for looking at this! I found there are some confusion about what I want to achieve here, please check the inlined answer > I’m not sure I understand what you’re trying to achieve. I understand you’d like to expose the snapshotId as it was when the current thread (or let’s just say writer) prepared its write, ie prior to the commit. I am trying to expose the snapshot id *AFTER* the commit, you can check the change at [here](https://github.com/apache/iceberg/pull/4795/files#diff-3c27b7c7615fcf206c99e3e441c0621561c0a19ac3e9a26bb142329879425151R332 ) the problem I am trying to resolve is from the following workflow ``` ThreadA committed ThreadB committed ThreadA want to do something with the snapshot it just committed ``` as I said in the description, currently there is no way for ThreadA to do this, as long as it calls currentSnapshotId(), it will get the latest snapshotId generated by ThreadB because of the refreshing of metadata. this PR is just to expose this information to ThreadA it will not break ACID, as if ThreadA committed another version, it will be still based on the latest snapshot id since I didn't change that part at all > > > I think the scenario is more pervasive than our own case, e.g. each notebook attached to the Databricks' notebook cluster is basically handled by a thread. In such an scenario, users may fall into some race condition to get the snapshot id committed by their own notebook with just currentSnapshot().snapshotId > > What catalog are you using? You mention Databricks, and most people I’ve encountered using Iceberg on Databricks are using the `HadoopCatalog`. Which should _not_ be used in a production environment as there’s no locking mechanism to keep the current snapshot updateable by only one writer at a time (be it across threads or across Spark applications). > > It sounds like maybe you’re trying to get around the lack of a lock, but I worry that you’ll have conflicting writes and clobber the previous writers work. I am using HadoopCatalog but not really getting around of lack of lock here (the lock problem was resolved by https://github.com/apache/iceberg/pull/3663 where you can use some DynamoDB based lockManager with HadoopTableOperation > > What do you intend to do with this thread local snapshot Id (particularly once it becomes outdated via some other writer). we have multiple scenarios needing this local snapshot id, the simplest example might be we have multiple threads committing to the same table, we want to logging this information for easier debugging (with information like which thread generated which version) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
