[
https://issues.apache.org/jira/browse/OAK-2739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14695237#comment-14695237
]
Stefan Egli commented on OAK-2739:
----------------------------------
[~chetanm], [~mreutegg], re ideas for these integration tests: what I tried now
was forcing an instance to write changes after it stopped updating its lease
(so that others see it as inactive). That unfortunately (or fortunately,
depends on the point of view ;) did not trigger any problem - at least I wasn't
able to so far. Perhaps you can think of one?
Other than that, what is known to cause problems is when 'discovery on oak'
would report an instance as inactive and upper level code then makes assumption
- eg it could elect a new leader even though a leader would still be active. I
think this class of problems is something outside of the scope of oak itself.
In other words: I'm wondering what sort of tests we can do here that make
sense..
> take appropriate action when lease cannot be renewed (in time)
> --------------------------------------------------------------
>
> Key: OAK-2739
> URL: https://issues.apache.org/jira/browse/OAK-2739
> Project: Jackrabbit Oak
> Issue Type: Task
> Components: mongomk
> Affects Versions: 1.2
> Reporter: Stefan Egli
> Assignee: Stefan Egli
> Labels: resilience
> Fix For: 1.3.5
>
> Attachments: OAK-2739.patch
>
>
> Currently, in an oak-cluster when (e.g.) one oak-client stops renewing its
> lease (ClusterNodeInfo.renewLease()), this will be eventually noticed by the
> others in the same oak-cluster. Those then mark this client as {{inactive}}
> and start recoverying and subsequently removing that node from any further
> merge etc operation.
> Now, whatever the reason was why that client stopped renewing the lease
> (could be an exception, deadlock, whatever) - that client itself still
> considers itself as {{active}} and continues to take part in the cluster
> action.
> This will result in a unbalanced situation where that one client 'sees'
> everybody as {{active}} while the others see this one as {{inactive}}.
> If this ClusterNodeInfo state should be something that can be built upon, and
> to avoid any inconsistency due to unbalanced handling, the inactive node
> should probably retire gracefully - or any other appropriate action should be
> taken, other than just continuing as today.
> This ticket is to keep track of ideas and actions taken wrt this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)