Hi all,

I've spent today doing a lot of QA, deploying various apps repeatedly.
It went mostly very well, but there were a few failures.

One thing this highlights is the need to write our entities + blueprints to handle transient failures.

Three areas spring to mind.

_*VM Provisioning Failures*_
Clouds can fail to give us an ssh'able VM.

Setting `machineCreateAttempts` will tell Brooklyn to retry if a VM fails to be created or comes back dead-on-arrival (e.g. can't ssh). This value currently defaults to 1 (i.e. if first attempt fails, then abort).

Perhaps we should change the default to 2?


_*Cluster quorum size*_
When starting a cluster (e.g. 16 Cassandra nodes, or whatever), we can get some failures. With the default configuration, any failures result in the cluster reporting itself as failed.

There is a configuration option, `cluster.initial.quorumSize`, which says the minimum number of initial nodes that must come up successfully for the cluster to be considered healthy. e.g. cluster.initial.quorumSize of 12 and cluster.initial.size of 16 means that we'll accept a maximum of 4 failures on initial deployment.

Should we have a more lenient default (e.g. two thirds of the cluster.initial.size)?


_*Command retries*_
Provisioning commands, e.g. ssh'ing to install software, sometimes fail.

For example, today I saw:
    Execution failed, invalid result -1 for installing CouchbaseNodeImpl
which most likely means there was an ssh connection failure while executing.

In situations like that, we should retry.

We should also retry by default on some other idempotent operations - installing, customizing and stopping are good contenders; launching is harder - it's up to the implementer to explicitly enable retry (but only if it is written to be idempotent; otherwise stop-then-start might be required for retry).

Aled

Reply via email to