Dear DRBD9 & drbdmanage users & testers,
we have published version 0.99.8 of drbdmanage.
If you use drbdmanage, you certainly want to upgrade.
- Bug fix for too short timeout for TCP communication: Since a long time
one leader node communicates with all other nodes in the cluster via
TCP. A wrong timeout gave satellites only 2s time to finish their work
(e.g., creating resource files, creating meta data, bringing the
resource up). This was a bug and in busy clusters users saw "pending
- Change quorum tracking: Quorum tracking is for example used for
leader election. Only if there is a majority of nodes, one of them
becomes the leader. This was based on connect events on the control
volume. If a node missed that event (e.g., the volume was already up),
it considered other nodes as offline, even though everything was fine.
Users saw that frequently when executing "drbdmanage nodes". Depending
on the time a node started, this was even inconsistent within the
cluster, as the local information is used to print that status. Now
all (drbd) connected nodes are considered. That said, the output of
"drbdmanage role" is the important one.
- So far drbdmanage only relied on its own view of the world. For
example if it thought a deployment is pending, it retried ad infinity
and failed because in fact the deployment was already done. For
deployment it now also considers the real world and checks if the
drbd resource already exists and is healthy. In that case it
considers the deployment successful.
- Fix locking between leader and its satellites. By switching to TCP for
communication and a threaded TCP server, we obviously introduced
concurrency. Unfortunately the locking between local components as
well as the cluster nodes in general was incomplete. A read on a
satellite at the wrong point in time overwrote the local control
volume (cluster DB), which under certain conditions then was sent back
as the new cluster DB to the leader. With that fix it does not matter
anymore if commands are executed on the leader or the satellite. This
was tested with while-true-loops on satellites and the leader while
another satellite created resources. For many of them ;).
- Read actions (and therefore local updates of the cluster DB)
potentially triggered actions on satellites. For example they created
resources, then later got the order from the leader to create that
same, existing resource and failed because it already existed. Now
satellite nodes do not execute any actions without getting an order
from the leader.
More details can be found in the according git logs.
As usual, we provide tar-balls, a git repo, and an Ubuntu PPA.
If you have any questions, suggestions or feedback for us, feel free to
post to the drbd-user mailing list.
Best regards, rck
drbd-user mailing list