We have a cluster of servers. At the moment there are three servers, each 
having two separate instances of CouchDB, like this:

        node0-couch1
        node0-couch2

        node1-couch1
        node1-couch2

        node2-couch1
        node2-couch2

All couch1 instances are set up to replicate continuously using bidirectional 
pull replication. That is:

        node0-couch1    pulls from node1-couch1 and node2-couch1
        node1-couch1    pulls from node0-couch1 and node2-couch1
        node2-couch1    pulls from node0-couch1 and node1-couch1

On each node, couch1 and couch2 are set up to replicate each other 
continuously, again using pull replication. Thus, the full replication topology 
is:

        node0-couch1    pulls from node1-couch, node2-couch1, and node0-couch2
        node0-couch2    pulls from node0-couch1

        node1-couch1    pulls from node0-couch1, node2-couch1, and node1-couch2
        node1-couch2    pulls from node1-couch1

        node2-couch1    pulls from node0-couch1, node1-couch1, and node2-couch2
        node2-couch2    pulls from node2-couch1

No proxies are involved. In our staging system, all servers are on the same 
subnet.

The problem is that every night, the entire cluster dies. All instances of 
CouchDB crash, and moreover they crash exactly simultaneously.

The data being replicated is very minimal at the moment - simple log text 
lines, no attachments. The entire database being replicated is no more than a 
few megabytes in size.

The syslogs give no clue. The CouchDB logs are difficult to interpret unless 
you are an Erlang programmer. If anyone would care to look at them, just let me 
know.

Any clues as to why this is happening? We're using 0.10.1 on Debian.

We are planning to build quite sophisticated transcluster job queue 
functionality on top of CouchDB, but of course a situation like this suggests 
that CouchDB replication currently is too unreliable to be of practical use, 
unless this is a known bug and/or a fixed one.

Any pointers or ideas are most welcome.

        / Peter Bengtson


Reply via email to