Igniters, Is anybody working on this bug? There is a high chance we can add a fix to 2.5 if the community agrees to release it earlier.
-- Denis On Thu, Mar 15, 2018 at 11:04 AM, Denis Magda <dma...@apache.org> wrote: > I dared to set fix version to 2.5 and increased the severity. It's > important to fix the race since we've just released the partition loss > functionality in 2.4 and it's already broken. > > Andrey, please keep us posted. If you didn't fix it, we would need to find > another contributor. > > -- > Denis > > On Thu, Mar 15, 2018 at 7:29 AM, Dmitry Pavlov <dpavlov....@gmail.com> > wrote: > >> Hi Andrew Mashenkov, >> >> would you like to pick up issue? >> >> Sincerely, >> Dmitriy Pavlov >> >> чт, 15 мар. 2018 г. в 6:23, Dmitriy Setrakyan <dsetrak...@apache.org>: >> >> > Completely agree, we must fix this. I like the proposed design. We >> should >> > also specify that resetLostPartitions() method should return true and >> > false. >> > >> > Val, do you mind updating the ticket with new design? >> > https://issues.apache.org/jira/browse/IGNITE-7832 >> > >> > D. >> > >> > On Tue, Mar 13, 2018 at 5:31 PM, Valentin Kulichenko < >> > valentin.kuliche...@gmail.com> wrote: >> > >> > > This indeed looks like a bigger issue. Basically, there is no clear >> way >> > (or >> > > no way at all) to synchronize code that listens to partition loss >> event, >> > > and the code that calls resetLostPartitions() method. Example >> scenario: >> > > >> > > 1. Cache is configured with 3rd party persistence. >> > > 2. One or more nodes fail causing loss of several partitions in >> memory. >> > > 3. Ignite blocks access to those partitions according to partition >> loss >> > > policy and fires an event. >> > > 4. Application listens to the event and starts reloading the data from >> > > store. >> > > 5. When reloading is complete, application calls >> resetLostPartitions() to >> > > restore access. >> > > 6. Nodes fail again causing another partition loss, new event is >> fired. >> > > >> > > There is race between steps 5 and 6. If 2nd failure happens BEFORE >> > > resetLostPartitions() is called, we end up with inconsistent data. >> > > >> > > I believe the only way to fix this is to add corresponding topology >> > version >> > > to partition loss event, and also add it as a parameter for >> > > resetLostPartitions(). >> > > This way if resetLostPartitions() is invoked with a version that is >> not >> > the >> > > latest anymore, the invocation will be ignored. >> > > >> > > The only problem with this approach is that topology version itself >> is >> > > currently not a part of public API. It needs to be properly exposed >> there >> > > first. >> > > >> > > -Val >> > > >> > > On Mon, Mar 12, 2018 at 1:07 PM, Denis Magda <dma...@apache.org> >> wrote: >> > > >> > > > Just in case here is you can find the present documentation: >> > > > >> > https://apacheignite.readme.io/docs/cache-modes#partition-loss-policies >> > > > >> > > > Let us know what needs to be updated once the issues reported by you >> > are >> > > > addressed. >> > > > >> > > > -- >> > > > Denis >> > > > >> > > > On Mon, Mar 12, 2018 at 3:33 AM, Andrey Mashenkov < >> > > > andrey.mashen...@gmail.com> wrote: >> > > > >> > > > > Hi Igniters, >> > > > > >> > > > > I've found we no documentation how user can recover cache from >> > > cacheStore >> > > > > in case of partition loss. >> > > > > Ignite provides some instruments (methods and events) that should >> > help >> > > > user >> > > > > to solve this problem, >> > > > > but looks like these instruments have an architecture lack. >> > > > > >> > > > > The first one is an usability issue. Ignite provides partition >> loss >> > > event >> > > > > to user can handle this, but Ignite fires an event per partition. >> > > > > Why we can't have an event with list of lost partitions? >> > > > > >> > > > > The second one is a bug. Ignite.resetLostPartitions() method >> doesn't >> > > care >> > > > > about what topology version recovered partitions belonged to. >> > > > > Tthere is a race, when user call this method after a node was >> failed, >> > > but >> > > > > right before Ignite fire an event. >> > > > > So, it is possible state of just lost partitions will be reseted >> > > > > unexpectedly. >> > > > > >> > > > > >> > > > > I've created a ticket for this [1] and think we should rethink the >> > > > > architecture of the partition recovery mechanics and improve >> > > > documentation. >> > > > > Any thoughts? >> > > > > >> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-7832 >> > > > > >> > > > > >> > > > > -- >> > > > > Best regards, >> > > > > Andrey V. Mashenkov >> > > > > >> > > > >> > > >> > >> > >