Re: Partition recovery issue on partition loss.

Dmitry Pavlov Thu, 15 Mar 2018 07:29:48 -0700

Hi Andrew Mashenkov,

would you like to pick up issue?


Sincerely,
Dmitriy Pavlov

чт, 15 мар. 2018 г. в 6:23, Dmitriy Setrakyan <[email protected]>:

> Completely agree, we must fix this. I like the proposed design. We should
> also specify that resetLostPartitions() method should return true and
> false.
>
> Val, do you mind updating the ticket with new design?
> https://issues.apache.org/jira/browse/IGNITE-7832
>
> D.
>
> On Tue, Mar 13, 2018 at 5:31 PM, Valentin Kulichenko <
> [email protected]> wrote:
>
> > This indeed looks like a bigger issue. Basically, there is no clear way
> (or
> > no way at all) to synchronize code that listens to partition loss event,
> > and the code that calls resetLostPartitions() method. Example scenario:
> >
> > 1. Cache is configured with 3rd party persistence.
> > 2. One or more nodes fail causing loss of several partitions in memory.
> > 3. Ignite blocks access to those partitions according to partition loss
> > policy and fires an event.
> > 4. Application listens to the event and starts reloading the data from
> > store.
> > 5. When reloading is complete, application calls resetLostPartitions() to
> > restore access.
> > 6. Nodes fail again causing another partition loss, new event is fired.
> >
> > There is race between steps 5 and 6. If 2nd failure happens BEFORE
> > resetLostPartitions() is called, we end up with inconsistent data.
> >
> > I believe the only way to fix this is to add corresponding topology
> version
> > to partition loss event, and also add it as a parameter for
> > resetLostPartitions().
> > This way if resetLostPartitions() is invoked with a version that is not
> the
> > latest anymore, the invocation will be ignored.
> >
> > The only problem with this approach  is that topology version itself is
> > currently not a part of public API. It needs to be properly exposed there
> > first.
> >
> > -Val
> >
> > On Mon, Mar 12, 2018 at 1:07 PM, Denis Magda <[email protected]> wrote:
> >
> > > Just in case here is you can find the present documentation:
> > >
> https://apacheignite.readme.io/docs/cache-modes#partition-loss-policies
> > >
> > > Let us know what needs to be updated once the issues reported by you
> are
> > > addressed.
> > >
> > > --
> > > Denis
> > >
> > > On Mon, Mar 12, 2018 at 3:33 AM, Andrey Mashenkov <
> > > [email protected]> wrote:
> > >
> > > > Hi Igniters,
> > > >
> > > > I've found we no documentation how user can recover cache from
> > cacheStore
> > > > in case of partition loss.
> > > > Ignite provides some instruments (methods and events) that should
> help
> > > user
> > > > to solve this problem,
> > > > but looks like these instruments have an architecture lack.
> > > >
> > > > The first one is an usability issue. Ignite provides partition loss
> > event
> > > > to user can handle this, but Ignite fires an event per partition.
> > > > Why we can't have an event with list of lost partitions?
> > > >
> > > > The second one is a bug. Ignite.resetLostPartitions() method doesn't
> > care
> > > > about what topology version recovered partitions belonged to.
> > > > Tthere is a race, when user call this method after a node was failed,
> > but
> > > > right before Ignite fire an event.
> > > > So, it is possible state of just lost partitions will be reseted
> > > > unexpectedly.
> > > >
> > > >
> > > > I've created a ticket for this [1] and think we should rethink the
> > > > architecture of the partition recovery mechanics and improve
> > > documentation.
> > > > Any thoughts?
> > > >
> > > > [1] https://issues.apache.org/jira/browse/IGNITE-7832
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Andrey V. Mashenkov
> > > >
> > >
> >
>

Re: Partition recovery issue on partition loss.

Reply via email to