Thanks for the pointer, Prashant! On Fri, Apr 18, 2025 at 12:11 PM Prashant Singh <prashant.si...@snowflake.com.invalid> wrote:
> Hey Dmitri, > > Yes we just remove the snapshot of data operations of type *REPLACE,* which > means no data was added or removed in this snapshot. (iceberg [code > < > https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/DataOperations.java#L43 > > > ]) > So we guaranteed that we never touch the snapshot which added / removed / > updated some rows. So the correctness remains intact and would never result > in data loss. > The PR is also ready for review : > https://github.com/apache/polaris/pull/1285 > It has tests as well demonstrating, with detailed comments on how it is > gonna work ! > > Best, > Prashant Singh > > > On Fri, Apr 18, 2025 at 8:56 AM Dmitri Bourlatchkov <di...@apache.org> > wrote: > > > Hi Prashant, > > > > Sorry for the delayed reply and apologies if I missed some relevant > > discussion. > > > > As I understand the catalog could remove snapshots that come in-between > > previous and current snapshots from the perspective of one of the > clients. > > > > Can we be sure that the removed snapshot does not have material data > > changes (e.g. new roes or updated rows) that should have been taken into > > account by the client whose snapshot is forced to become "current". Could > > this result in data loss? > > > > Thanks, > > Dmitri. > > > > On 2025/03/31 22:44:03 Prashant Singh wrote: > > > Hey folks, > > > > > > I wanted to propose this feature to Apache Polaris Rolling back > > > replacements operation snapshots in the case during the concurrent > write > > > (compaction and other writers trying to commit to the table at the same > > > time) to Iceberg there are conflicts. This is a feature which Ryan > > proposed > > > as an alternative when I was proposing a Priority Amongst Writer > proposal > > > [1] in the Apache Iceberg community. This kind of makes the compaction > > > always a low priority process. > > > > > > Earlier, I went ahead and added this feature as a client side change in > > the > > > Apache Iceberg repo [2] . It got some attraction but this didn't get to > > the > > > end. Now when we think more about it again Apache Polaris seems to be > the > > > best place to do it as it can benefit other language writer clients as > > well > > > and Polaris is the one to actually apply the commits based on the > > > requirements and update sent by Iceberg Rest Client. > > > > > > Here is my draft PR [3] on how I think this can be achieved, given this > > is > > > enabled by a table property, happy to discuss other knobs for ex: maybe > > > check the snapshot prop ? > > > > > > The logic essentially if we see is the base (B) on which the snapshot > we > > > want to include/commit is based on is changed to something like (B`) > and > > > the given snapshot from B` to B are all of ops type *REPLACE *. It adds > > > other updates within the same update Table req > > > 1. moved the snapshot ref to B > > > 2. [Optional] to remove the snapshot between B` to B given its all of > > > *REPLACE*. > > > Then try the requirements and updates again on the updated base and see > > if > > > it succeeds. To make all this as part of one updateReq and then commit > to > > > the table. > > > Doing it this way preserves the schema changes for which no new > snapshot > > > has been created, just a new metadata.json is created. > > > > > > Happy to know your thoughts on the same. > > > > > > Links: > > > [1] > > > > > > https://docs.google.com/document/d/1pSqxf5A59J062j9VFF5rcCpbW9vdTbBKTmjps80D-B0/edit?tab=t.0#heading=h.fn6jmpw6phpn > > > [2] https://github.com/apache/iceberg/pull/5888 > > > [3] https://github.com/apache/polaris/pull/1285 > > > > > > Best, > > > Prashant Singh > > > > > >