What I have seen so far is mostly related to init/sync limit together with snapshot size. (ZOOKEEPER-1697, ZOOKEEPER-1521)
It might be possible that a client trying to reconnect cause a load spike on the server and push the server over the limit, but you will have to have lots of clients in this case. I think it will be easier to narrow down the problem by checking which phase (e.g. Leader election or synchronization) the quorum fails -- Thawan Kooburat On 5/13/13 10:48 AM, "Marshall McMullen" <[email protected]> wrote: >I'm debugging a problem we're seeing where after quorum loss quorum does >not recover as I expect it should. It seems that I've isolated the problem >to quorum not be re-established if there are clients trying to connect to >the ensemble at the same time that the nodes are coming up and trying to >form quorum. Is there any known issue with this? I've searched for open >Jiras without any luck.
