getting temporary errors when a node goes down, until the other nodes' failure detectors realize it's down, is normal. (this should only take a dozen seconds, or so.)
but after that it should route requests to other nodes, and it should also realize when you restart #5 that it is alive again. those are two separate issues. can you verify that "bin/nodeprobe cluster" shows that node 1 eventually does/does not see #5 dead, and alive again? -Jonathan On Tue, Sep 8, 2009 at 5:05 PM, Simon Smith<[email protected]> wrote: > I'm seeing an issue similar to: > > http://issues.apache.org/jira/browse/CASSANDRA-169 > > Here is when I see it. I'm running Cassandra on 5 nodes using the > OrderPreservingPartitioner, and have populated Cassandra with 78 > records, and I can use get_key_range via Thrift just fine. Then, if I > manually kill one of the nodes (if I kill off node #5), the node (node > #1) which I've been using to call get_key_range will timeout and the > error: > > Thrift: Internal error processing get_key_range > > And the Cassandra output shows the same trace as in 169: > > ERROR - Encountered IOException on connection: > java.nio.channels.SocketChannel[closed] > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592) > at > org.apache.cassandra.net.TcpConnection.connect(TcpConnection.java:349) > at > org.apache.cassandra.net.SelectorManager.doProcess(SelectorManager.java:131) > at > org.apache.cassandra.net.SelectorManager.run(SelectorManager.java:98) > WARN - Closing down connection java.nio.channels.SocketChannel[closed] > ERROR - Internal error processing get_key_range > java.lang.RuntimeException: java.util.concurrent.TimeoutException: > Operation timed out. > at > org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:573) > at > org.apache.cassandra.service.CassandraServer.get_key_range(CassandraServer.java:595) > at > org.apache.cassandra.service.Cassandra$Processor$get_key_range.process(Cassandra.java:853) > at > org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:606) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:675) > Caused by: java.util.concurrent.TimeoutException: Operation timed out. > at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:97) > at > org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:569) > ... 7 more > > > > If it was giving an error just one time, I could just rely on catching > the error and trying again. But a get_key_range call to that node I > was already making get_key_range queries against (node #1) never works > again (it is still up and it responds fine to multiget Thrift calls), > sometimes not even after I restart the down node (node #5). I end up > having to restart node #1 in addition to node #5. The behavior for > the other 3 nodes varies - some of them are also unable to respond to > get_key_range calls, but some of them do respond to get_key_range > calls. > > My question is, what path should I go down in terms of reproducing > this problem? I'm using Aug 27 trunk code - should I update my > Cassandra install prior to gathering more information for this issue, > and if so, which version (0.4 or trunk). If there is anyone who is > familiar with this issue, could you let me know what I might be doing > wrong, or what the next info-gathering step should be for me? > > Thank you, > > Simon Smith > Arcode Corporation >
