Re: get_key_range (CASSANDRA-169)

Simon Smith Wed, 09 Sep 2009 16:03:48 -0700

I think it might take me quite a bit of effort for me figure out how
to use a java debugger - it will be a lot quicker if you can give me a
patch, then I can certainly re-build using ant against either latest
trunk or latest 0.4 and re-run my test.


Thanks,

Simon

On Wed, Sep 9, 2009 at 6:52 PM, Jonathan Ellis <[email protected]> wrote:
> Okay, so when #5 comes back up, #1 eventually stops erroring out and
> you don't have to restart #1?  That is good, that would have been a
> bigger problem. :)
>
> If you are comfortable using a Java debugger (by default Cassandra
> listens for one on 8888) you can look at what is going on inside
> StorageProxy.getKeyRange on node #1 at the call to
>
>        EndPoint endPoint =
> StorageService.instance().findSuitableEndPoint(command.startWith);
>
> findSuitableEndpoint is supposed to pick a live node, not a dead one. :)
>
> If not I can write a patch to log extra information for this bug so we
> can track it down.
>
> -Jonathan
>
> On Wed, Sep 9, 2009 at 5:43 PM, Simon Smith<[email protected]> wrote:
>> The error starts as soon as the downed node #5 goes down and lasts
>> until I restart the downed node #5.
>>
>> bin/nodeprobe cluster is accurate (it knows quickly when #5 is down,
>> and when it is up again)
>>
>> Since I set the replication set to 3, I'm confused as to why (after
>> the first few seconds or so) there is an error just because one host
>> is down temporarily.
>>
>> The way I have the test setup is that I have a script running on each
>> of the nodes that is running the get_key_range over and over to
>> "localhost".  Depending on which node I take down, the behavior
>> varies: if I take done one host, it is the only one giving errors (the
>> other 4 nodes still work).  For the other 4 situations, either 2 or 3
>> nodes continue to work (i.e. the downed node and either one or two
>> other nodes are the ones giving errors).  Note: the nodes that keep
>> working, never fail at all, not even for a few seconds.
>>
>> I am running this on 4GB "cloud server" boxes in Rackspace, I can set
>> up just about any test needed to help debug this and capture output or
>> logs, and can give a Cassandra developer access if it would help.  Of
>> course I can include whatever config files or log files would be
>> helpful, I just don't want to spam the list unless it is relevant.
>>
>> Thanks again,
>>
>> Simon
>>
>>
>> On Tue, Sep 8, 2009 at 6:26 PM, Jonathan Ellis<[email protected]> wrote:
>>> getting temporary errors when a node goes down, until the other nodes'
>>> failure detectors realize it's down, is normal.  (this should only
>>> take a dozen seconds, or so.)
>>>
>>> but after that it should route requests to other nodes, and it should
>>> also realize when you restart #5 that it is alive again.  those are
>>> two separate issues.
>>>
>>> can you verify that "bin/nodeprobe cluster" shows that node 1
>>> eventually does/does not see #5 dead, and alive again?
>>>
>>> -Jonathan
>>>
>>> On Tue, Sep 8, 2009 at 5:05 PM, Simon Smith<[email protected]> wrote:
>>>> I'm seeing an issue similar to:
>>>>
>>>> http://issues.apache.org/jira/browse/CASSANDRA-169
>>>>
>>>> Here is when I see it.  I'm running Cassandra on 5 nodes using the
>>>> OrderPreservingPartitioner, and have populated Cassandra with 78
>>>> records, and I can use get_key_range via Thrift just fine.  Then, if I
>>>> manually kill one of the nodes (if I kill off node #5), the node (node
>>>> #1) which I've been using to call get_key_range will timeout and the
>>>> error:
>>>>
>>>>  Thrift: Internal error processing get_key_range
>>>>
>>>> And the Cassandra output shows the same trace as in 169:
>>>>
>>>> ERROR - Encountered IOException on connection:
>>>> java.nio.channels.SocketChannel[closed]
>>>> java.net.ConnectException: Connection refused
>>>>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>>>        at 
>>>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
>>>>        at 
>>>> org.apache.cassandra.net.TcpConnection.connect(TcpConnection.java:349)
>>>>        at 
>>>> org.apache.cassandra.net.SelectorManager.doProcess(SelectorManager.java:131)
>>>>        at 
>>>> org.apache.cassandra.net.SelectorManager.run(SelectorManager.java:98)
>>>> WARN - Closing down connection java.nio.channels.SocketChannel[closed]
>>>> ERROR - Internal error processing get_key_range
>>>> java.lang.RuntimeException: java.util.concurrent.TimeoutException:
>>>> Operation timed out.
>>>>        at 
>>>> org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:573)
>>>>        at 
>>>> org.apache.cassandra.service.CassandraServer.get_key_range(CassandraServer.java:595)
>>>>        at 
>>>> org.apache.cassandra.service.Cassandra$Processor$get_key_range.process(Cassandra.java:853)
>>>>        at 
>>>> org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:606)
>>>>        at 
>>>> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
>>>>        at 
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>>>        at 
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>>>        at java.lang.Thread.run(Thread.java:675)
>>>> Caused by: java.util.concurrent.TimeoutException: Operation timed out.
>>>>        at org.apache.cassandra.net.AsyncResult.get(AsyncResult.java:97)
>>>>        at 
>>>> org.apache.cassandra.service.StorageProxy.getKeyRange(StorageProxy.java:569)
>>>>        ... 7 more
>>>>
>>>>
>>>>
>>>> If it was giving an error just one time, I could just rely on catching
>>>> the error and trying again.  But a get_key_range call to that node I
>>>> was already making get_key_range queries against (node #1) never works
>>>> again (it is still up and it responds fine to multiget Thrift calls),
>>>> sometimes not even after I restart the down node (node #5).  I end up
>>>> having to restart node #1 in addition to node #5.  The behavior for
>>>> the other 3 nodes varies - some of them  are also unable to respond to
>>>> get_key_range calls, but some of them do respond to get_key_range
>>>> calls.
>>>>
>>>> My question is, what path should I go down in terms of reproducing
>>>> this problem?  I'm using Aug 27 trunk code - should I update my
>>>> Cassandra install prior to gathering more information for this issue,
>>>> and if so, which version (0.4 or trunk).  If there is anyone who is
>>>> familiar with this issue, could you let me know what I might be doing
>>>> wrong, or what the next info-gathering step should be for me?
>>>>
>>>> Thank you,
>>>>
>>>> Simon Smith
>>>> Arcode Corporation
>>>>
>>>
>>
>

Re: get_key_range (CASSANDRA-169)

Reply via email to