Re: Partition recovery issue on partition loss.

Dmitry Pavlov Tue, 27 Mar 2018 09:22:39 -0700

Denis, it seems noone is working.

чт, 22 мар. 2018 г. в 21:26, Denis Magda <[email protected]>:


> Igniters,
>
> Is anybody working on this bug? There is a high chance we can add a fix to
> 2.5 if the community agrees to release it earlier.
>
> --
> Denis
>
> On Thu, Mar 15, 2018 at 11:04 AM, Denis Magda <[email protected]> wrote:
>
> > I dared to set fix version to 2.5 and increased the severity. It's
> > important to fix the race since we've just released the partition loss
> > functionality in 2.4 and it's already broken.
> >
> > Andrey, please keep us posted. If you didn't fix it, we would need to
> find
> > another contributor.
> >
> > --
> > Denis
> >
> > On Thu, Mar 15, 2018 at 7:29 AM, Dmitry Pavlov <[email protected]>
> > wrote:
> >
> >> Hi Andrew Mashenkov,
> >>
> >> would you like to pick up issue?
> >>
> >> Sincerely,
> >> Dmitriy Pavlov
> >>
> >> чт, 15 мар. 2018 г. в 6:23, Dmitriy Setrakyan <[email protected]>:
> >>
> >> > Completely agree, we must fix this. I like the proposed design. We
> >> should
> >> > also specify that resetLostPartitions() method should return true and
> >> > false.
> >> >
> >> > Val, do you mind updating the ticket with new design?
> >> > https://issues.apache.org/jira/browse/IGNITE-7832
> >> >
> >> > D.
> >> >
> >> > On Tue, Mar 13, 2018 at 5:31 PM, Valentin Kulichenko <
> >> > [email protected]> wrote:
> >> >
> >> > > This indeed looks like a bigger issue. Basically, there is no clear
> >> way
> >> > (or
> >> > > no way at all) to synchronize code that listens to partition loss
> >> event,
> >> > > and the code that calls resetLostPartitions() method. Example
> >> scenario:
> >> > >
> >> > > 1. Cache is configured with 3rd party persistence.
> >> > > 2. One or more nodes fail causing loss of several partitions in
> >> memory.
> >> > > 3. Ignite blocks access to those partitions according to partition
> >> loss
> >> > > policy and fires an event.
> >> > > 4. Application listens to the event and starts reloading the data
> from
> >> > > store.
> >> > > 5. When reloading is complete, application calls
> >> resetLostPartitions() to
> >> > > restore access.
> >> > > 6. Nodes fail again causing another partition loss, new event is
> >> fired.
> >> > >
> >> > > There is race between steps 5 and 6. If 2nd failure happens BEFORE
> >> > > resetLostPartitions() is called, we end up with inconsistent data.
> >> > >
> >> > > I believe the only way to fix this is to add corresponding topology
> >> > version
> >> > > to partition loss event, and also add it as a parameter for
> >> > > resetLostPartitions().
> >> > > This way if resetLostPartitions() is invoked with a version that is
> >> not
> >> > the
> >> > > latest anymore, the invocation will be ignored.
> >> > >
> >> > > The only problem with this approach  is that topology version itself
> >> is
> >> > > currently not a part of public API. It needs to be properly exposed
> >> there
> >> > > first.
> >> > >
> >> > > -Val
> >> > >
> >> > > On Mon, Mar 12, 2018 at 1:07 PM, Denis Magda <[email protected]>
> >> wrote:
> >> > >
> >> > > > Just in case here is you can find the present documentation:
> >> > > >
> >> >
> https://apacheignite.readme.io/docs/cache-modes#partition-loss-policies
> >> > > >
> >> > > > Let us know what needs to be updated once the issues reported by
> you
> >> > are
> >> > > > addressed.
> >> > > >
> >> > > > --
> >> > > > Denis
> >> > > >
> >> > > > On Mon, Mar 12, 2018 at 3:33 AM, Andrey Mashenkov <
> >> > > > [email protected]> wrote:
> >> > > >
> >> > > > > Hi Igniters,
> >> > > > >
> >> > > > > I've found we no documentation how user can recover cache from
> >> > > cacheStore
> >> > > > > in case of partition loss.
> >> > > > > Ignite provides some instruments (methods and events) that
> should
> >> > help
> >> > > > user
> >> > > > > to solve this problem,
> >> > > > > but looks like these instruments have an architecture lack.
> >> > > > >
> >> > > > > The first one is an usability issue. Ignite provides partition
> >> loss
> >> > > event
> >> > > > > to user can handle this, but Ignite fires an event per
> partition.
> >> > > > > Why we can't have an event with list of lost partitions?
> >> > > > >
> >> > > > > The second one is a bug. Ignite.resetLostPartitions() method
> >> doesn't
> >> > > care
> >> > > > > about what topology version recovered partitions belonged to.
> >> > > > > Tthere is a race, when user call this method after a node was
> >> failed,
> >> > > but
> >> > > > > right before Ignite fire an event.
> >> > > > > So, it is possible state of just lost partitions will be reseted
> >> > > > > unexpectedly.
> >> > > > >
> >> > > > >
> >> > > > > I've created a ticket for this [1] and think we should rethink
> the
> >> > > > > architecture of the partition recovery mechanics and improve
> >> > > > documentation.
> >> > > > > Any thoughts?
> >> > > > >
> >> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-7832
> >> > > > >
> >> > > > >
> >> > > > > --
> >> > > > > Best regards,
> >> > > > > Andrey V. Mashenkov
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Partition recovery issue on partition loss.

Reply via email to