Hi Andrew Mashenkov, would you like to pick up issue?
Sincerely, Dmitriy Pavlov чт, 15 мар. 2018 г. в 6:23, Dmitriy Setrakyan <[email protected]>: > Completely agree, we must fix this. I like the proposed design. We should > also specify that resetLostPartitions() method should return true and > false. > > Val, do you mind updating the ticket with new design? > https://issues.apache.org/jira/browse/IGNITE-7832 > > D. > > On Tue, Mar 13, 2018 at 5:31 PM, Valentin Kulichenko < > [email protected]> wrote: > > > This indeed looks like a bigger issue. Basically, there is no clear way > (or > > no way at all) to synchronize code that listens to partition loss event, > > and the code that calls resetLostPartitions() method. Example scenario: > > > > 1. Cache is configured with 3rd party persistence. > > 2. One or more nodes fail causing loss of several partitions in memory. > > 3. Ignite blocks access to those partitions according to partition loss > > policy and fires an event. > > 4. Application listens to the event and starts reloading the data from > > store. > > 5. When reloading is complete, application calls resetLostPartitions() to > > restore access. > > 6. Nodes fail again causing another partition loss, new event is fired. > > > > There is race between steps 5 and 6. If 2nd failure happens BEFORE > > resetLostPartitions() is called, we end up with inconsistent data. > > > > I believe the only way to fix this is to add corresponding topology > version > > to partition loss event, and also add it as a parameter for > > resetLostPartitions(). > > This way if resetLostPartitions() is invoked with a version that is not > the > > latest anymore, the invocation will be ignored. > > > > The only problem with this approach is that topology version itself is > > currently not a part of public API. It needs to be properly exposed there > > first. > > > > -Val > > > > On Mon, Mar 12, 2018 at 1:07 PM, Denis Magda <[email protected]> wrote: > > > > > Just in case here is you can find the present documentation: > > > > https://apacheignite.readme.io/docs/cache-modes#partition-loss-policies > > > > > > Let us know what needs to be updated once the issues reported by you > are > > > addressed. > > > > > > -- > > > Denis > > > > > > On Mon, Mar 12, 2018 at 3:33 AM, Andrey Mashenkov < > > > [email protected]> wrote: > > > > > > > Hi Igniters, > > > > > > > > I've found we no documentation how user can recover cache from > > cacheStore > > > > in case of partition loss. > > > > Ignite provides some instruments (methods and events) that should > help > > > user > > > > to solve this problem, > > > > but looks like these instruments have an architecture lack. > > > > > > > > The first one is an usability issue. Ignite provides partition loss > > event > > > > to user can handle this, but Ignite fires an event per partition. > > > > Why we can't have an event with list of lost partitions? > > > > > > > > The second one is a bug. Ignite.resetLostPartitions() method doesn't > > care > > > > about what topology version recovered partitions belonged to. > > > > Tthere is a race, when user call this method after a node was failed, > > but > > > > right before Ignite fire an event. > > > > So, it is possible state of just lost partitions will be reseted > > > > unexpectedly. > > > > > > > > > > > > I've created a ticket for this [1] and think we should rethink the > > > > architecture of the partition recovery mechanics and improve > > > documentation. > > > > Any thoughts? > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-7832 > > > > > > > > > > > > -- > > > > Best regards, > > > > Andrey V. Mashenkov > > > > > > > > > >
