Re: get_key_range (CASSANDRA-169)

Jonathan Ellis Tue, 08 Sep 2009 15:26:42 -0700

getting temporary errors when a node goes down, until the other nodes'
failure detectors realize it's down, is normal.  (this should only
take a dozen seconds, or so.)


but after that it should route requests to other nodes, and it should
also realize when you restart #5 that it is alive again.  those are
two separate issues.

can you verify that "bin/nodeprobe cluster" shows that node 1
eventually does/does not see #5 dead, and alive again?

-Jonathan

On Tue, Sep 8, 2009 at 5:05 PM, Simon Smith<[email protected]> wrote:
> I'm seeing an issue similar to:
>
> http://issues.apache.org/jira/browse/CASSANDRA-169
>
> Here is when I see it.  I'm running Cassandra on 5 nodes using the
> OrderPreservingPartitioner, and have populated Cassandra with 78
> records, and I can use get_key_range via Thrift just fine.  Then, if I
> manually kill one of the nodes (if I kill off node #5), the node (node
> #1) which I've been using to call get_key_range will timeout and the
> error:
>
>  Thrift: Internal error processing get_key_range
>
> And the Cassandra output shows the same trace as in 169:
>
> ERROR - Encountered IOException on connection:
> java.nio.channels.SocketChannel[closed]
> java.net.ConnectException: Connection refused
>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>        at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
>        at 
> org.apache.cassandra.net.TcpConnection.connect(TcpConnection.java:349)
>        at 
> org.apache.cassandra.net.SelectorManager.doProcess(SelectorManager.java:131)
>        at 
> org.apache.cassandra.net.SelectorManager.run(SelectorManager.java:98)
> WARN - Closing down connection java.nio.channels.SocketChannel[closed]
> ERROR - Internal error processing get_key_range
> java.lang.RuntimeException: java.util.concurrent.TimeoutException:
> Operation timed out.
>        at 
> org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:573)
>        at 
> org.apache.cassandra.service.CassandraServer.get_key_range(CassandraServer.java:595)
>        at 
> org.apache.cassandra.service.Cassandra$Processor$get_key_range.process(Cassandra.java:853)
>        at 
> org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:606)
>        at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
>        at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>        at java.lang.Thread.run(Thread.java:675)
> Caused by: java.util.concurrent.TimeoutException: Operation timed out.
>        at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:97)
>        at 
> org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:569)
>        ... 7 more
>
>
>
> If it was giving an error just one time, I could just rely on catching
> the error and trying again.  But a get_key_range call to that node I
> was already making get_key_range queries against (node #1) never works
> again (it is still up and it responds fine to multiget Thrift calls),
> sometimes not even after I restart the down node (node #5).  I end up
> having to restart node #1 in addition to node #5.  The behavior for
> the other 3 nodes varies - some of them  are also unable to respond to
> get_key_range calls, but some of them do respond to get_key_range
> calls.
>
> My question is, what path should I go down in terms of reproducing
> this problem?  I'm using Aug 27 trunk code - should I update my
> Cassandra install prior to gathering more information for this issue,
> and if so, which version (0.4 or trunk).  If there is anyone who is
> familiar with this issue, could you let me know what I might be doing
> wrong, or what the next info-gathering step should be for me?
>
> Thank you,
>
> Simon Smith
> Arcode Corporation
>

Re: get_key_range (CASSANDRA-169)

Reply via email to