i'm observing the following on a cluster that started with 4 nodes. i have been killing and restarting the various nodes as i test cassandra and now i'm seeing a lot of NotFoundException exceptions in the client because what i believe is ring state out of sync between the two nodes that are still up and available. The first ring state shown below reflects the current state of the cluster. Also I have seen similar issues when one of the nodes thinks another node is still available when in fact it has been killed. it seems to be related to bringing up, killing nodes too fast and not letting them figure out when a node is "dead". in this case i see TimedOutException related to NIO SocketChannel class.
thx! [cassandra.883477]$ bin/nodeprobe -host gen-app02.dev.real.com -port 8080 ring Address Status Load Range Ring 144038903974614862325597275257769797985 172.27.128.186Down 22.17 MB 31124469348629903091013930339840898757 |<--| 172.27.128.23 Down 22.17 MB 64378740291415296162944450043143967518 | | 172.27.128.22 Up 22.17 MB 121134220722269938669001112695509564769 | | 172.27.128.185Up 14.69 MB 144038903974614862325597275257769797985 |-->| [cassandra.883477]$ bin/nodeprobe -host vmguest85.prognet.com -port 8080 ring Address Status Load Range Ring 144038903974614862325597275257769797985 172.27.128.22 Up 22.17 MB 121134220722269938669001112695509564769 |<--| 172.27.128.185Up 14.69 MB 144038903974614862325597275257769797985 |-->| [cassandra.883477]$
