I've been experimenting a bit with trying to propagate failures to bind() server ports in tests up to where we can do something about it. There's at least one category of test cases (callers of ReconfigTest.testPortChangeToBlockedPort) where the server is supposed to ride through a bind() failure, recovering on a subsequent reconfiguration. In my current code state, I'm encountering errors like this:
2018-11-24 11:04:46,252 [myid:] - INFO [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor@878] - Got user-level KeeperException when processing sessionid:0x1002b98aa830000 type:reconfig cxid:0x1e zxid:0x10000002b txntype:-1 reqpath:n/a Error Path:null Error:KeeperErrorCode = ReconfigInProgress I can hack things until this particular test passes, but it raises questions about reconfiguration in general. How exactly is the cluster supposed to get out of this state? If a cluster member drops out of contact with the quorum while there is a reconfiguration in flight, is there any recovery path that restores the ability to process a reconfigure operation? Is there a design doc for reconfiguration that demonstrates the kind of robustness against Byzantine faults that one is led to expect from Zookeeper?
