Hey Dmitri,

Yes we just remove the snapshot of data operations of type *REPLACE,* which
means no data was added or removed in this snapshot. (iceberg [code
<https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/DataOperations.java#L43>
])
So we guaranteed that we never touch the snapshot which added / removed /
updated some rows. So the correctness remains intact and would never result
in data loss.
The PR is also ready for review :
https://github.com/apache/polaris/pull/1285
It has tests as well demonstrating, with detailed comments on how it is
gonna work !

Best,
Prashant Singh


On Fri, Apr 18, 2025 at 8:56 AM Dmitri Bourlatchkov <di...@apache.org>
wrote:

> Hi Prashant,
>
> Sorry for the delayed reply and apologies if I missed some relevant
> discussion.
>
> As I understand the catalog could remove snapshots that come in-between
> previous and current snapshots from the perspective of one of the clients.
>
> Can we be sure that the removed snapshot does not have material data
> changes (e.g. new roes or updated rows) that should have been taken into
> account by the client whose snapshot is forced to become "current". Could
> this result in data loss?
>
> Thanks,
> Dmitri.
>
> On 2025/03/31 22:44:03 Prashant Singh wrote:
> > Hey folks,
> >
> > I wanted to propose this feature to Apache Polaris Rolling back
> > replacements operation snapshots in the case during the concurrent write
> > (compaction and other writers trying to commit to the table at the same
> > time) to Iceberg there are conflicts. This is a feature which Ryan
> proposed
> > as an alternative when I was proposing a Priority Amongst Writer proposal
> > [1]  in the Apache Iceberg community. This kind of makes the compaction
> > always a low priority process.
> >
> > Earlier, I went ahead and added this feature as a client side change in
> the
> > Apache Iceberg repo [2] . It got some attraction but this didn't get to
> the
> > end. Now when we think more about it again Apache Polaris seems to be the
> > best place to do it as it can benefit other language writer clients as
> well
> > and Polaris is the one to actually apply the commits based on the
> > requirements and update sent by Iceberg Rest Client.
> >
> > Here is my draft PR [3] on how I think this can be achieved, given this
> is
> > enabled by a table property, happy to discuss other knobs for ex: maybe
> > check the snapshot prop ?
> >
> > The logic essentially if we see is the base (B) on which the snapshot we
> > want to include/commit is based on is changed to something like (B`) and
> > the given snapshot from B` to B are all of ops type *REPLACE *. It adds
> > other updates within the same update Table req
> > 1. moved the snapshot ref to B
> > 2. [Optional] to remove the snapshot between B` to B given its all of
> > *REPLACE*.
> > Then try the requirements and updates again on the updated base and see
> if
> > it succeeds. To make all this as part of one updateReq and then commit to
> > the table.
> > Doing it this way preserves the schema changes for which no new snapshot
> > has been created, just a new metadata.json is created.
> >
> > Happy to know your thoughts on the same.
> >
> > Links:
> > [1]
> >
> https://docs.google.com/document/d/1pSqxf5A59J062j9VFF5rcCpbW9vdTbBKTmjps80D-B0/edit?tab=t.0#heading=h.fn6jmpw6phpn
> > [2] https://github.com/apache/iceberg/pull/5888
> > [3] https://github.com/apache/polaris/pull/1285
> >
> > Best,
> > Prashant Singh
> >
>

Reply via email to