Denis, it seems noone is working. чт, 22 мар. 2018 г. в 21:26, Denis Magda <dma...@apache.org>:
> Igniters, > > Is anybody working on this bug? There is a high chance we can add a fix to > 2.5 if the community agrees to release it earlier. > > -- > Denis > > On Thu, Mar 15, 2018 at 11:04 AM, Denis Magda <dma...@apache.org> wrote: > > > I dared to set fix version to 2.5 and increased the severity. It's > > important to fix the race since we've just released the partition loss > > functionality in 2.4 and it's already broken. > > > > Andrey, please keep us posted. If you didn't fix it, we would need to > find > > another contributor. > > > > -- > > Denis > > > > On Thu, Mar 15, 2018 at 7:29 AM, Dmitry Pavlov <dpavlov....@gmail.com> > > wrote: > > > >> Hi Andrew Mashenkov, > >> > >> would you like to pick up issue? > >> > >> Sincerely, > >> Dmitriy Pavlov > >> > >> чт, 15 мар. 2018 г. в 6:23, Dmitriy Setrakyan <dsetrak...@apache.org>: > >> > >> > Completely agree, we must fix this. I like the proposed design. We > >> should > >> > also specify that resetLostPartitions() method should return true and > >> > false. > >> > > >> > Val, do you mind updating the ticket with new design? > >> > https://issues.apache.org/jira/browse/IGNITE-7832 > >> > > >> > D. > >> > > >> > On Tue, Mar 13, 2018 at 5:31 PM, Valentin Kulichenko < > >> > valentin.kuliche...@gmail.com> wrote: > >> > > >> > > This indeed looks like a bigger issue. Basically, there is no clear > >> way > >> > (or > >> > > no way at all) to synchronize code that listens to partition loss > >> event, > >> > > and the code that calls resetLostPartitions() method. Example > >> scenario: > >> > > > >> > > 1. Cache is configured with 3rd party persistence. > >> > > 2. One or more nodes fail causing loss of several partitions in > >> memory. > >> > > 3. Ignite blocks access to those partitions according to partition > >> loss > >> > > policy and fires an event. > >> > > 4. Application listens to the event and starts reloading the data > from > >> > > store. > >> > > 5. When reloading is complete, application calls > >> resetLostPartitions() to > >> > > restore access. > >> > > 6. Nodes fail again causing another partition loss, new event is > >> fired. > >> > > > >> > > There is race between steps 5 and 6. If 2nd failure happens BEFORE > >> > > resetLostPartitions() is called, we end up with inconsistent data. > >> > > > >> > > I believe the only way to fix this is to add corresponding topology > >> > version > >> > > to partition loss event, and also add it as a parameter for > >> > > resetLostPartitions(). > >> > > This way if resetLostPartitions() is invoked with a version that is > >> not > >> > the > >> > > latest anymore, the invocation will be ignored. > >> > > > >> > > The only problem with this approach is that topology version itself > >> is > >> > > currently not a part of public API. It needs to be properly exposed > >> there > >> > > first. > >> > > > >> > > -Val > >> > > > >> > > On Mon, Mar 12, 2018 at 1:07 PM, Denis Magda <dma...@apache.org> > >> wrote: > >> > > > >> > > > Just in case here is you can find the present documentation: > >> > > > > >> > > https://apacheignite.readme.io/docs/cache-modes#partition-loss-policies > >> > > > > >> > > > Let us know what needs to be updated once the issues reported by > you > >> > are > >> > > > addressed. > >> > > > > >> > > > -- > >> > > > Denis > >> > > > > >> > > > On Mon, Mar 12, 2018 at 3:33 AM, Andrey Mashenkov < > >> > > > andrey.mashen...@gmail.com> wrote: > >> > > > > >> > > > > Hi Igniters, > >> > > > > > >> > > > > I've found we no documentation how user can recover cache from > >> > > cacheStore > >> > > > > in case of partition loss. > >> > > > > Ignite provides some instruments (methods and events) that > should > >> > help > >> > > > user > >> > > > > to solve this problem, > >> > > > > but looks like these instruments have an architecture lack. > >> > > > > > >> > > > > The first one is an usability issue. Ignite provides partition > >> loss > >> > > event > >> > > > > to user can handle this, but Ignite fires an event per > partition. > >> > > > > Why we can't have an event with list of lost partitions? > >> > > > > > >> > > > > The second one is a bug. Ignite.resetLostPartitions() method > >> doesn't > >> > > care > >> > > > > about what topology version recovered partitions belonged to. > >> > > > > Tthere is a race, when user call this method after a node was > >> failed, > >> > > but > >> > > > > right before Ignite fire an event. > >> > > > > So, it is possible state of just lost partitions will be reseted > >> > > > > unexpectedly. > >> > > > > > >> > > > > > >> > > > > I've created a ticket for this [1] and think we should rethink > the > >> > > > > architecture of the partition recovery mechanics and improve > >> > > > documentation. > >> > > > > Any thoughts? > >> > > > > > >> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-7832 > >> > > > > > >> > > > > > >> > > > > -- > >> > > > > Best regards, > >> > > > > Andrey V. Mashenkov > >> > > > > > >> > > > > >> > > > >> > > >> > > > > >