I think it might take me quite a bit of effort for me figure out how to use a java debugger - it will be a lot quicker if you can give me a patch, then I can certainly re-build using ant against either latest trunk or latest 0.4 and re-run my test.
Thanks, Simon On Wed, Sep 9, 2009 at 6:52 PM, Jonathan Ellis <[email protected]> wrote: > Okay, so when #5 comes back up, #1 eventually stops erroring out and > you don't have to restart #1? That is good, that would have been a > bigger problem. :) > > If you are comfortable using a Java debugger (by default Cassandra > listens for one on 8888) you can look at what is going on inside > StorageProxy.getKeyRange on node #1 at the call to > > EndPoint endPoint = > StorageService.instance().findSuitableEndPoint(command.startWith); > > findSuitableEndpoint is supposed to pick a live node, not a dead one. :) > > If not I can write a patch to log extra information for this bug so we > can track it down. > > -Jonathan > > On Wed, Sep 9, 2009 at 5:43 PM, Simon Smith<[email protected]> wrote: >> The error starts as soon as the downed node #5 goes down and lasts >> until I restart the downed node #5. >> >> bin/nodeprobe cluster is accurate (it knows quickly when #5 is down, >> and when it is up again) >> >> Since I set the replication set to 3, I'm confused as to why (after >> the first few seconds or so) there is an error just because one host >> is down temporarily. >> >> The way I have the test setup is that I have a script running on each >> of the nodes that is running the get_key_range over and over to >> "localhost". Depending on which node I take down, the behavior >> varies: if I take done one host, it is the only one giving errors (the >> other 4 nodes still work). For the other 4 situations, either 2 or 3 >> nodes continue to work (i.e. the downed node and either one or two >> other nodes are the ones giving errors). Note: the nodes that keep >> working, never fail at all, not even for a few seconds. >> >> I am running this on 4GB "cloud server" boxes in Rackspace, I can set >> up just about any test needed to help debug this and capture output or >> logs, and can give a Cassandra developer access if it would help. Of >> course I can include whatever config files or log files would be >> helpful, I just don't want to spam the list unless it is relevant. >> >> Thanks again, >> >> Simon >> >> >> On Tue, Sep 8, 2009 at 6:26 PM, Jonathan Ellis<[email protected]> wrote: >>> getting temporary errors when a node goes down, until the other nodes' >>> failure detectors realize it's down, is normal. (this should only >>> take a dozen seconds, or so.) >>> >>> but after that it should route requests to other nodes, and it should >>> also realize when you restart #5 that it is alive again. those are >>> two separate issues. >>> >>> can you verify that "bin/nodeprobe cluster" shows that node 1 >>> eventually does/does not see #5 dead, and alive again? >>> >>> -Jonathan >>> >>> On Tue, Sep 8, 2009 at 5:05 PM, Simon Smith<[email protected]> wrote: >>>> I'm seeing an issue similar to: >>>> >>>> http://issues.apache.org/jira/browse/CASSANDRA-169 >>>> >>>> Here is when I see it. I'm running Cassandra on 5 nodes using the >>>> OrderPreservingPartitioner, and have populated Cassandra with 78 >>>> records, and I can use get_key_range via Thrift just fine. Then, if I >>>> manually kill one of the nodes (if I kill off node #5), the node (node >>>> #1) which I've been using to call get_key_range will timeout and the >>>> error: >>>> >>>> Thrift: Internal error processing get_key_range >>>> >>>> And the Cassandra output shows the same trace as in 169: >>>> >>>> ERROR - Encountered IOException on connection: >>>> java.nio.channels.SocketChannel[closed] >>>> java.net.ConnectException: Connection refused >>>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) >>>> at >>>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592) >>>> at >>>> org.apache.cassandra.net.TcpConnection.connect(TcpConnection.java:349) >>>> at >>>> org.apache.cassandra.net.SelectorManager.doProcess(SelectorManager.java:131) >>>> at >>>> org.apache.cassandra.net.SelectorManager.run(SelectorManager.java:98) >>>> WARN - Closing down connection java.nio.channels.SocketChannel[closed] >>>> ERROR - Internal error processing get_key_range >>>> java.lang.RuntimeException: java.util.concurrent.TimeoutException: >>>> Operation timed out. >>>> at >>>> org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:573) >>>> at >>>> org.apache.cassandra.service.CassandraServer.get_key_range(CassandraServer.java:595) >>>> at >>>> org.apache.cassandra.service.Cassandra$Processor$get_key_range.process(Cassandra.java:853) >>>> at >>>> org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:606) >>>> at >>>> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) >>>> at java.lang.Thread.run(Thread.java:675) >>>> Caused by: java.util.concurrent.TimeoutException: Operation timed out. >>>> at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:97) >>>> at >>>> org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:569) >>>> ... 7 more >>>> >>>> >>>> >>>> If it was giving an error just one time, I could just rely on catching >>>> the error and trying again. But a get_key_range call to that node I >>>> was already making get_key_range queries against (node #1) never works >>>> again (it is still up and it responds fine to multiget Thrift calls), >>>> sometimes not even after I restart the down node (node #5). I end up >>>> having to restart node #1 in addition to node #5. The behavior for >>>> the other 3 nodes varies - some of them are also unable to respond to >>>> get_key_range calls, but some of them do respond to get_key_range >>>> calls. >>>> >>>> My question is, what path should I go down in terms of reproducing >>>> this problem? I'm using Aug 27 trunk code - should I update my >>>> Cassandra install prior to gathering more information for this issue, >>>> and if so, which version (0.4 or trunk). If there is anyone who is >>>> familiar with this issue, could you let me know what I might be doing >>>> wrong, or what the next info-gathering step should be for me? >>>> >>>> Thank you, >>>> >>>> Simon Smith >>>> Arcode Corporation >>>> >>> >> >
