I've been tracking an error we see occasionally on our cluster, we're currently running behind trunk at build 047b07a298d84e9755c6e06c035787ce397f4958.
We've been seeing this error, it's quite rare and so far I've had no luck reproducing it in a controlled setting. The symptom is that C clients see errors of the form: ZOO_ERROR@handle_socket_error_msg@2726: Socket [10.11.13.2:2181] zk retcode=-2, errno=115(Operation now in progress): unexpected server response: expected 0x529a8be8, but received 0x529a8be6 (note the expected/received entries are reversed here, we always receive a larger entry than we were expecting). Kazoo clients are also failing similarly, with the error: zookeeper: xids do not match, expected %r received %r', 1435, 1436 Generally we see these failures in groups, where multiple clients will see these failures from one server over a 5 or ten second windows. Sometimes one client can fail with the error multiple times in that period. I'd appreciate any insight anyone can give me into why this is happening and how we might fix it. Has anyone seen this before? Any hunches what code or conditions I might investigate to reliably trigger or fix the error? I'd just greatly appreciate any help in identifying the problem. -- -=-Dutch
