Re: Partition recovery issue on partition loss.

Valentin Kulichenko Tue, 13 Mar 2018 17:33:09 -0700

This indeed looks like a bigger issue. Basically, there is no clear way (or
no way at all) to synchronize code that listens to partition loss event,
and the code that calls resetLostPartitions() method. Example scenario:

1. Cache is configured with 3rd party persistence.
2. One or more nodes fail causing loss of several partitions in memory.
3. Ignite blocks access to those partitions according to partition loss
policy and fires an event.
4. Application listens to the event and starts reloading the data from
store.
5. When reloading is complete, application calls resetLostPartitions() to
restore access.
6. Nodes fail again causing another partition loss, new event is fired.

There is race between steps 5 and 6. If 2nd failure happens BEFORE
resetLostPartitions() is called, we end up with inconsistent data.

I believe the only way to fix this is to add corresponding topology version
to partition loss event, and also add it as a parameter for
resetLostPartitions().
This way if resetLostPartitions() is invoked with a version that is not the
latest anymore, the invocation will be ignored.

The only problem with this approach  is that topology version itself is
currently not a part of public API. It needs to be properly exposed there
first.

-Val

On Mon, Mar 12, 2018 at 1:07 PM, Denis Magda <[email protected]> wrote:

> Just in case here is you can find the present documentation:
> https://apacheignite.readme.io/docs/cache-modes#partition-loss-policies
>
> Let us know what needs to be updated once the issues reported by you are
> addressed.
>
> --
> Denis
>
> On Mon, Mar 12, 2018 at 3:33 AM, Andrey Mashenkov <
> [email protected]> wrote:
>
> > Hi Igniters,
> >
> > I've found we no documentation how user can recover cache from cacheStore
> > in case of partition loss.
> > Ignite provides some instruments (methods and events) that should help
> user
> > to solve this problem,
> > but looks like these instruments have an architecture lack.
> >
> > The first one is an usability issue. Ignite provides partition loss event
> > to user can handle this, but Ignite fires an event per partition.
> > Why we can't have an event with list of lost partitions?
> >
> > The second one is a bug. Ignite.resetLostPartitions() method doesn't care
> > about what topology version recovered partitions belonged to.
> > Tthere is a race, when user call this method after a node was failed, but
> > right before Ignite fire an event.
> > So, it is possible state of just lost partitions will be reseted
> > unexpectedly.
> >
> >
> > I've created a ticket for this [1] and think we should rethink the
> > architecture of the partition recovery mechanics and improve
> documentation.
> > Any thoughts?
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-7832
> >
> >
> > --
> > Best regards,
> > Andrey V. Mashenkov
> >
>

Re: Partition recovery issue on partition loss.

Reply via email to