We have a cluster of servers. At the moment there are three servers, each
having two separate instances of CouchDB, like this:
node0-couch1
node0-couch2
node1-couch1
node1-couch2
node2-couch1
node2-couch2
All couch1 instances are set up to replicate continuously using bidirectional
pull replication. That is:
node0-couch1 pulls from node1-couch1 and node2-couch1
node1-couch1 pulls from node0-couch1 and node2-couch1
node2-couch1 pulls from node0-couch1 and node1-couch1
On each node, couch1 and couch2 are set up to replicate each other
continuously, again using pull replication. Thus, the full replication topology
is:
node0-couch1 pulls from node1-couch, node2-couch1, and node0-couch2
node0-couch2 pulls from node0-couch1
node1-couch1 pulls from node0-couch1, node2-couch1, and node1-couch2
node1-couch2 pulls from node1-couch1
node2-couch1 pulls from node0-couch1, node1-couch1, and node2-couch2
node2-couch2 pulls from node2-couch1
No proxies are involved. In our staging system, all servers are on the same
subnet.
The problem is that every night, the entire cluster dies. All instances of
CouchDB crash, and moreover they crash exactly simultaneously.
The data being replicated is very minimal at the moment - simple log text
lines, no attachments. The entire database being replicated is no more than a
few megabytes in size.
The syslogs give no clue. The CouchDB logs are difficult to interpret unless
you are an Erlang programmer. If anyone would care to look at them, just let me
know.
Any clues as to why this is happening? We're using 0.10.1 on Debian.
We are planning to build quite sophisticated transcluster job queue
functionality on top of CouchDB, but of course a situation like this suggests
that CouchDB replication currently is too unreliable to be of practical use,
unless this is a known bug and/or a fixed one.
Any pointers or ideas are most welcome.
/ Peter Bengtson