Hey Dmitri, Yes we just remove the snapshot of data operations of type *REPLACE,* which means no data was added or removed in this snapshot. (iceberg [code <https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/DataOperations.java#L43> ]) So we guaranteed that we never touch the snapshot which added / removed / updated some rows. So the correctness remains intact and would never result in data loss. The PR is also ready for review : https://github.com/apache/polaris/pull/1285 It has tests as well demonstrating, with detailed comments on how it is gonna work !
Best, Prashant Singh On Fri, Apr 18, 2025 at 8:56 AM Dmitri Bourlatchkov <di...@apache.org> wrote: > Hi Prashant, > > Sorry for the delayed reply and apologies if I missed some relevant > discussion. > > As I understand the catalog could remove snapshots that come in-between > previous and current snapshots from the perspective of one of the clients. > > Can we be sure that the removed snapshot does not have material data > changes (e.g. new roes or updated rows) that should have been taken into > account by the client whose snapshot is forced to become "current". Could > this result in data loss? > > Thanks, > Dmitri. > > On 2025/03/31 22:44:03 Prashant Singh wrote: > > Hey folks, > > > > I wanted to propose this feature to Apache Polaris Rolling back > > replacements operation snapshots in the case during the concurrent write > > (compaction and other writers trying to commit to the table at the same > > time) to Iceberg there are conflicts. This is a feature which Ryan > proposed > > as an alternative when I was proposing a Priority Amongst Writer proposal > > [1] in the Apache Iceberg community. This kind of makes the compaction > > always a low priority process. > > > > Earlier, I went ahead and added this feature as a client side change in > the > > Apache Iceberg repo [2] . It got some attraction but this didn't get to > the > > end. Now when we think more about it again Apache Polaris seems to be the > > best place to do it as it can benefit other language writer clients as > well > > and Polaris is the one to actually apply the commits based on the > > requirements and update sent by Iceberg Rest Client. > > > > Here is my draft PR [3] on how I think this can be achieved, given this > is > > enabled by a table property, happy to discuss other knobs for ex: maybe > > check the snapshot prop ? > > > > The logic essentially if we see is the base (B) on which the snapshot we > > want to include/commit is based on is changed to something like (B`) and > > the given snapshot from B` to B are all of ops type *REPLACE *. It adds > > other updates within the same update Table req > > 1. moved the snapshot ref to B > > 2. [Optional] to remove the snapshot between B` to B given its all of > > *REPLACE*. > > Then try the requirements and updates again on the updated base and see > if > > it succeeds. To make all this as part of one updateReq and then commit to > > the table. > > Doing it this way preserves the schema changes for which no new snapshot > > has been created, just a new metadata.json is created. > > > > Happy to know your thoughts on the same. > > > > Links: > > [1] > > > https://docs.google.com/document/d/1pSqxf5A59J062j9VFF5rcCpbW9vdTbBKTmjps80D-B0/edit?tab=t.0#heading=h.fn6jmpw6phpn > > [2] https://github.com/apache/iceberg/pull/5888 > > [3] https://github.com/apache/polaris/pull/1285 > > > > Best, > > Prashant Singh > > >