If you are not into lease handling and cluster node recovery, this might not be for you.
In a cloud based cluster app w. oak, we seem to encounter node VM lockups now and then. We are in the process of drilling down what is causing this with the cloud hoster. But whatever this is, nodes are experiencing it. So far, this happened always to a single node at a time while the other cluster nodes ran unaffected. However, when the freeze is just long enough, lease renewal fails and oak shuts down. This is rather painful. And before you say "well, tell the hoster to fix it.", think of DoS or broken cables in the cloud, disks can be network based storage, etc. Clocks can get out of sync too and jump ahead. The bigger the cluster, the higher the likelihood. 2 Observations relevant (hopefully) for the people concerned with this part of oak on this list: A. the freeze duration of no return is the "leaseEndTime - failureMargin", so 20 seconds too early, due to a bug in the ClusterNodeInfo instantiation. I will file a ticket once Apache Jira is working again. B. If A gets fixed, a renewal in the first 4 seconds of other activity will be successful, *no matter the duration of the freeze* (my reading of the code, unless the cluster node data in the store was changed in the meantime...) C. the freeze duration of no return is the "leaseEndTime - failureMargin", so 20 seconds too early, due to a bug in the ClusterNodeInfo instantiation. I will file a ticket once Apache Jira is working again. So, my question to the experts is: will B work? should it? What are the risks? Local index copies? Can this be mitigated? I'm afraid such scenarios are moving from pure academic to realistic in cloud/vm hostings. Cheers, Stefan Eissing <green/>bytes GmbH Hafenstrasse 16 48155 Münster www.greenbytes.de
