Re: oak lease handling and clusters recovery

Julian Reschke Thu, 12 Jan 2017 07:58:37 -0800

On 2017-01-12 14:19, Stefan Eissing wrote:

If you are not into lease handling and cluster node recovery, this might not be 
for you.



In a cloud based cluster app w. oak, we seem to encounter node VM lockups now 
and then. We are in the process of drilling down what is causing this with the 
cloud hoster. But whatever this is, nodes are experiencing it. So far, this 
happened always to a single node at a time while the other cluster nodes ran 
unaffected. However, when the freeze is just long enough, lease renewal fails 
and oak shuts down.

This is rather painful. And before you say "well, tell the hoster to fix it.", 
think of DoS or broken cables in the cloud, disks can be network based storage, etc. 
Clocks can get out of sync too and jump ahead. The bigger the cluster, the higher the 
likelihood.

2 Observations relevant (hopefully) for the people concerned with this part of 
oak on this list:

  A. the freeze duration of no return is the "leaseEndTime - failureMargin", so 
20 seconds too early, due to a bug in the ClusterNodeInfo instantiation. I will file a 
ticket once Apache Jira is working again.
...

FWIW, this code will only help in single-node scenarios anyway. If thereis another node running concurrently which can see the persistence,it''ll declare that node as dead by updating the associatedClusterNodeInfo, and subsequently running LastRevRecovery on it.

I do agree that there seems that the retry loop never will do the rightthing, and something needs to be fixed here.


Best regards, Julian

(see also OAK-5446 once Jira is up again)

Re: oak lease handling and clusters recovery

Reply via email to