Re: Partition recovery issue on partition loss.

Dmitriy Setrakyan Wed, 14 Mar 2018 20:23:31 -0700

Completely agree, we must fix this. I like the proposed design. We should
also specify that resetLostPartitions() method should return true and false.


Val, do you mind updating the ticket with new design?
https://issues.apache.org/jira/browse/IGNITE-7832

D.

On Tue, Mar 13, 2018 at 5:31 PM, Valentin Kulichenko <
[email protected]> wrote:

> This indeed looks like a bigger issue. Basically, there is no clear way (or
> no way at all) to synchronize code that listens to partition loss event,
> and the code that calls resetLostPartitions() method. Example scenario:
>
> 1. Cache is configured with 3rd party persistence.
> 2. One or more nodes fail causing loss of several partitions in memory.
> 3. Ignite blocks access to those partitions according to partition loss
> policy and fires an event.
> 4. Application listens to the event and starts reloading the data from
> store.
> 5. When reloading is complete, application calls resetLostPartitions() to
> restore access.
> 6. Nodes fail again causing another partition loss, new event is fired.
>
> There is race between steps 5 and 6. If 2nd failure happens BEFORE
> resetLostPartitions() is called, we end up with inconsistent data.
>
> I believe the only way to fix this is to add corresponding topology version
> to partition loss event, and also add it as a parameter for
> resetLostPartitions().
> This way if resetLostPartitions() is invoked with a version that is not the
> latest anymore, the invocation will be ignored.
>
> The only problem with this approach  is that topology version itself is
> currently not a part of public API. It needs to be properly exposed there
> first.
>
> -Val
>
> On Mon, Mar 12, 2018 at 1:07 PM, Denis Magda <[email protected]> wrote:
>
> > Just in case here is you can find the present documentation:
> > https://apacheignite.readme.io/docs/cache-modes#partition-loss-policies
> >
> > Let us know what needs to be updated once the issues reported by you are
> > addressed.
> >
> > --
> > Denis
> >
> > On Mon, Mar 12, 2018 at 3:33 AM, Andrey Mashenkov <
> > [email protected]> wrote:
> >
> > > Hi Igniters,
> > >
> > > I've found we no documentation how user can recover cache from
> cacheStore
> > > in case of partition loss.
> > > Ignite provides some instruments (methods and events) that should help
> > user
> > > to solve this problem,
> > > but looks like these instruments have an architecture lack.
> > >
> > > The first one is an usability issue. Ignite provides partition loss
> event
> > > to user can handle this, but Ignite fires an event per partition.
> > > Why we can't have an event with list of lost partitions?
> > >
> > > The second one is a bug. Ignite.resetLostPartitions() method doesn't
> care
> > > about what topology version recovered partitions belonged to.
> > > Tthere is a race, when user call this method after a node was failed,
> but
> > > right before Ignite fire an event.
> > > So, it is possible state of just lost partitions will be reseted
> > > unexpectedly.
> > >
> > >
> > > I've created a ticket for this [1] and think we should rethink the
> > > architecture of the partition recovery mechanics and improve
> > documentation.
> > > Any thoughts?
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-7832
> > >
> > >
> > > --
> > > Best regards,
> > > Andrey V. Mashenkov
> > >
> >
>

Re: Partition recovery issue on partition loss.

Reply via email to