Completely agree, we must fix this. I like the proposed design. We should also specify that resetLostPartitions() method should return true and false.
Val, do you mind updating the ticket with new design? https://issues.apache.org/jira/browse/IGNITE-7832 D. On Tue, Mar 13, 2018 at 5:31 PM, Valentin Kulichenko < valentin.kuliche...@gmail.com> wrote: > This indeed looks like a bigger issue. Basically, there is no clear way (or > no way at all) to synchronize code that listens to partition loss event, > and the code that calls resetLostPartitions() method. Example scenario: > > 1. Cache is configured with 3rd party persistence. > 2. One or more nodes fail causing loss of several partitions in memory. > 3. Ignite blocks access to those partitions according to partition loss > policy and fires an event. > 4. Application listens to the event and starts reloading the data from > store. > 5. When reloading is complete, application calls resetLostPartitions() to > restore access. > 6. Nodes fail again causing another partition loss, new event is fired. > > There is race between steps 5 and 6. If 2nd failure happens BEFORE > resetLostPartitions() is called, we end up with inconsistent data. > > I believe the only way to fix this is to add corresponding topology version > to partition loss event, and also add it as a parameter for > resetLostPartitions(). > This way if resetLostPartitions() is invoked with a version that is not the > latest anymore, the invocation will be ignored. > > The only problem with this approach is that topology version itself is > currently not a part of public API. It needs to be properly exposed there > first. > > -Val > > On Mon, Mar 12, 2018 at 1:07 PM, Denis Magda <dma...@apache.org> wrote: > > > Just in case here is you can find the present documentation: > > https://apacheignite.readme.io/docs/cache-modes#partition-loss-policies > > > > Let us know what needs to be updated once the issues reported by you are > > addressed. > > > > -- > > Denis > > > > On Mon, Mar 12, 2018 at 3:33 AM, Andrey Mashenkov < > > andrey.mashen...@gmail.com> wrote: > > > > > Hi Igniters, > > > > > > I've found we no documentation how user can recover cache from > cacheStore > > > in case of partition loss. > > > Ignite provides some instruments (methods and events) that should help > > user > > > to solve this problem, > > > but looks like these instruments have an architecture lack. > > > > > > The first one is an usability issue. Ignite provides partition loss > event > > > to user can handle this, but Ignite fires an event per partition. > > > Why we can't have an event with list of lost partitions? > > > > > > The second one is a bug. Ignite.resetLostPartitions() method doesn't > care > > > about what topology version recovered partitions belonged to. > > > Tthere is a race, when user call this method after a node was failed, > but > > > right before Ignite fire an event. > > > So, it is possible state of just lost partitions will be reseted > > > unexpectedly. > > > > > > > > > I've created a ticket for this [1] and think we should rethink the > > > architecture of the partition recovery mechanics and improve > > documentation. > > > Any thoughts? > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-7832 > > > > > > > > > -- > > > Best regards, > > > Andrey V. Mashenkov > > > > > >