Thanks for the pointer, Prashant!

On Fri, Apr 18, 2025 at 12:11 PM Prashant Singh
<prashant.si...@snowflake.com.invalid> wrote:

> Hey Dmitri,
>
> Yes we just remove the snapshot of data operations of type *REPLACE,* which
> means no data was added or removed in this snapshot. (iceberg [code
> <
> https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/DataOperations.java#L43
> >
> ])
> So we guaranteed that we never touch the snapshot which added / removed /
> updated some rows. So the correctness remains intact and would never result
> in data loss.
> The PR is also ready for review :
> https://github.com/apache/polaris/pull/1285
> It has tests as well demonstrating, with detailed comments on how it is
> gonna work !
>
> Best,
> Prashant Singh
>
>
> On Fri, Apr 18, 2025 at 8:56 AM Dmitri Bourlatchkov <di...@apache.org>
> wrote:
>
> > Hi Prashant,
> >
> > Sorry for the delayed reply and apologies if I missed some relevant
> > discussion.
> >
> > As I understand the catalog could remove snapshots that come in-between
> > previous and current snapshots from the perspective of one of the
> clients.
> >
> > Can we be sure that the removed snapshot does not have material data
> > changes (e.g. new roes or updated rows) that should have been taken into
> > account by the client whose snapshot is forced to become "current". Could
> > this result in data loss?
> >
> > Thanks,
> > Dmitri.
> >
> > On 2025/03/31 22:44:03 Prashant Singh wrote:
> > > Hey folks,
> > >
> > > I wanted to propose this feature to Apache Polaris Rolling back
> > > replacements operation snapshots in the case during the concurrent
> write
> > > (compaction and other writers trying to commit to the table at the same
> > > time) to Iceberg there are conflicts. This is a feature which Ryan
> > proposed
> > > as an alternative when I was proposing a Priority Amongst Writer
> proposal
> > > [1]  in the Apache Iceberg community. This kind of makes the compaction
> > > always a low priority process.
> > >
> > > Earlier, I went ahead and added this feature as a client side change in
> > the
> > > Apache Iceberg repo [2] . It got some attraction but this didn't get to
> > the
> > > end. Now when we think more about it again Apache Polaris seems to be
> the
> > > best place to do it as it can benefit other language writer clients as
> > well
> > > and Polaris is the one to actually apply the commits based on the
> > > requirements and update sent by Iceberg Rest Client.
> > >
> > > Here is my draft PR [3] on how I think this can be achieved, given this
> > is
> > > enabled by a table property, happy to discuss other knobs for ex: maybe
> > > check the snapshot prop ?
> > >
> > > The logic essentially if we see is the base (B) on which the snapshot
> we
> > > want to include/commit is based on is changed to something like (B`)
> and
> > > the given snapshot from B` to B are all of ops type *REPLACE *. It adds
> > > other updates within the same update Table req
> > > 1. moved the snapshot ref to B
> > > 2. [Optional] to remove the snapshot between B` to B given its all of
> > > *REPLACE*.
> > > Then try the requirements and updates again on the updated base and see
> > if
> > > it succeeds. To make all this as part of one updateReq and then commit
> to
> > > the table.
> > > Doing it this way preserves the schema changes for which no new
> snapshot
> > > has been created, just a new metadata.json is created.
> > >
> > > Happy to know your thoughts on the same.
> > >
> > > Links:
> > > [1]
> > >
> >
> https://docs.google.com/document/d/1pSqxf5A59J062j9VFF5rcCpbW9vdTbBKTmjps80D-B0/edit?tab=t.0#heading=h.fn6jmpw6phpn
> > > [2] https://github.com/apache/iceberg/pull/5888
> > > [3] https://github.com/apache/polaris/pull/1285
> > >
> > > Best,
> > > Prashant Singh
> > >
> >
>

Reply via email to