Hello, Relatively new to the ZK code base, please be gentle. :) This is bordering on a question for users@, but I'm asking here because I'm more than happy to try and dig into the code if it's not too far beyond my reach -- hope that's okay.
I'm trying to dig into / work around ZOOKEEPER-2938: https://issues.apache.org/jira/browse/ZOOKEEPER-2938 Unfortunately, the proposed work-around (simply restarting the leader) isn't particularly great for us because of some limitations in our automation -- so I'm trying to see if we can find some alternatives and/or fix the issue properly. Looking at https://github.com/apache/zookeeper/blob/75411ab34a3d53c43c2d508b12314a9788aa417d/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L391 -- afaict what's happening is the "unhappy"/prospective member of the quorum is attempting to connect to other, established members, sends a challenge request (which seems to just be a simple payload consisting of its ID and the local election host + port), then promptly closes the connection because its own ID is less than that of the recipient(s) -- seemingly without waiting for a response. The mechanics are all easy enough to understand, but I feel like I'm lacking some context RE: what's *supposed *to happen here. When this code is all working as expected, what *should *happen with respect to these challenges? What is this code trying to achieve by forcefully disconnecting from peers with an ID greater than the local peer? I also don't fully understand why restarting the leader would fix things, but that's probably just something I need to dive into to get to the bottom of this. Appreciate any guidance. Cheers, Tom
