I just saw a novel failure mode the other day that might have interesting implications for Zookeeper.
The problem was that the network MTU was 9000 while one machine's MTU was set much smaller. This meant that large incoming packets were dropped but all outgoing packets were OK. There were other problems with the networking in the real case as well, but for a thought experiment this is enough. The horrible implication of this failure is that a typical heart-beat or are-you-ok request will succeed while a typical content request will fail. This leads to a situation where hosts appear to be healthy, but they can't actually do anything. What will ZK do in this case?
