Hey, >> Generally the majority can be wrong: Assume you have a >> network-failure in a three-node MMR configuration: You update one >> node while the other two are unreachable. The communication resumes, >> do you expect the change on the none node to be reverted to majority, >> or should the majority be updated from the one node that has more >> recent data? > > Indeed. In syncrepl, "voting" is irrelevant. Changes will be accepted > by any provider node that a client can reach. When connectivity is > restored all nodes will bring each other up to date. In majority-based > voting, you will lose any writes to the minority node, which leaves > you with unresolvable inconsistencies. I.e., data is removed but the > clients believe it was written.
This turns out to be a matter of choice -- I would not go for majority voting without getting confirmation from a majority about the success of a transaction, which is what BerkeleyDB does. What I'm hearing here is that this "formal" approach leads to more delays, and it doesn't add much in practice -- just the *certainty* about data having been stored with the quality level assured by replication. The certainty comes at a writing delay, and is only of use when lightning strikes just after a write to one master. Interestingly, OpenStack Swift takes the same approach -- commit a write based on local storage, then replicate later. > back-hdb and back-bdb both use BerkeleyDB. BerkeleyDB is now > deprecated/obsolete, and LMDB is the default backend. I'm preparing new installations, so I suppose I will get to see it as the default. > BDB's replication is page-oriented, so it would consume far more > network resources than syncrepl. We have never recommended its use. It was indeed a design consideration that I was weighing. I think the trade-off recommended here is clear, and makes sense. I don't flush after every disk use either, after all. Thanks, -Rick
