On 2017-01-12 14:19, Stefan Eissing wrote:
If you are not into lease handling and cluster node recovery, this might not be
for you.
In a cloud based cluster app w. oak, we seem to encounter node VM lockups now
and then. We are in the process of drilling down what is causing this with the
cloud hoster. But whatever this is, nodes are experiencing it. So far, this
happened always to a single node at a time while the other cluster nodes ran
unaffected. However, when the freeze is just long enough, lease renewal fails
and oak shuts down.
This is rather painful. And before you say "well, tell the hoster to fix it.",
think of DoS or broken cables in the cloud, disks can be network based storage, etc.
Clocks can get out of sync too and jump ahead. The bigger the cluster, the higher the
likelihood.
2 Observations relevant (hopefully) for the people concerned with this part of
oak on this list:
A. the freeze duration of no return is the "leaseEndTime - failureMargin", so
20 seconds too early, due to a bug in the ClusterNodeInfo instantiation. I will file a
ticket once Apache Jira is working again.
...
FWIW, this code will only help in single-node scenarios anyway. If there
is another node running concurrently which can see the persistence,
it''ll declare that node as dead by updating the associated
ClusterNodeInfo, and subsequently running LastRevRecovery on it.
I do agree that there seems that the retry loop never will do the right
thing, and something needs to be fixed here.
Best regards, Julian
(see also OAK-5446 once Jira is up again)