If you are not into lease handling and cluster node recovery, this might not be 
for you.


In a cloud based cluster app w. oak, we seem to encounter node VM lockups now 
and then. We are in the process of drilling down what is causing this with the 
cloud hoster. But whatever this is, nodes are experiencing it. So far, this 
happened always to a single node at a time while the other cluster nodes ran 
unaffected. However, when the freeze is just long enough, lease renewal fails 
and oak shuts down. 

This is rather painful. And before you say "well, tell the hoster to fix it.", 
think of DoS or broken cables in the cloud, disks can be network based storage, 
etc. Clocks can get out of sync too and jump ahead. The bigger the cluster, the 
higher the likelihood.

2 Observations relevant (hopefully) for the people concerned with this part of 
oak on this list:

  A. the freeze duration of no return is the "leaseEndTime - failureMargin", so 
20 seconds too early, due to a bug in the ClusterNodeInfo instantiation. I will 
file a ticket once Apache Jira is working again.
  B. If A gets fixed, a renewal in the first 4 seconds of other activity will 
be successful, *no matter the duration of the freeze* (my reading of the code, 
unless the cluster node data in the store was changed in the meantime...) 
  C. the freeze duration of no return is the "leaseEndTime - failureMargin", so 
20 seconds too early, due to a bug in the ClusterNodeInfo instantiation. I will 
file a ticket once Apache Jira is working again.

So, my question to the experts is: will B work? should it? What are the risks? 
Local index copies? Can this be mitigated?

I'm afraid such scenarios are moving from pure academic to realistic in 
cloud/vm hostings.

Cheers,

Stefan Eissing

<green/>bytes GmbH
Hafenstrasse 16
48155 Münster
www.greenbytes.de

Reply via email to