Handling transient provisioning failures

Aled Sage Wed, 19 Nov 2014 15:55:00 -0800

Hi all,

I've spent today doing a lot of QA, deploying various apps repeatedly.
It went mostly very well, but there were a few failures.

One thing this highlights is the need to write our entities + blueprintsto handle transient failures.


Three areas spring to mind.

_*VM Provisioning Failures*_
Clouds can fail to give us an ssh'able VM.

Setting `machineCreateAttempts` will tell Brooklyn to retry if a VMfails to be created or comes back dead-on-arrival (e.g. can't ssh).This value currently defaults to 1 (i.e. if first attempt fails, thenabort).


Perhaps we should change the default to 2?


_*Cluster quorum size*_

When starting a cluster (e.g. 16 Cassandra nodes, or whatever), we canget some failures.With the default configuration, any failures result in the clusterreporting itself as failed.

There is a configuration option, `cluster.initial.quorumSize`, whichsays the minimum number of initial nodes that must come up successfullyfor the cluster to be considered healthy.e.g. cluster.initial.quorumSize of 12 and cluster.initial.size of 16means that we'll accept a maximum of 4 failures on initial deployment.

Should we have a more lenient default (e.g. two thirds of thecluster.initial.size)?



_*Command retries*_
Provisioning commands, e.g. ssh'ing to install software, sometimes fail.

For example, today I saw:
    Execution failed, invalid result -1 for installing CouchbaseNodeImpl
which most likely means there was an ssh connection failure while executing.

In situations like that, we should retry.

We should also retry by default on some other idempotent operations -installing, customizing and stopping are good contenders; launching isharder - it's up to the implementer to explicitly enable retry (but onlyif it is written to be idempotent; otherwise stop-then-start might berequired for retry).


Aled

Handling transient provisioning failures

Reply via email to