Re: TimedOutException

Ramzi Rabah Thu, 17 Dec 2009 16:44:32 -0800

Ok I believe the problem is when I was upgrading to a newer build of
cassandra, I was upgrading the servers one by one by restarting them.
So at one point of time I had some nodes that were 2 days older than
the others, and it seems to have caused the inter-node messaging to go
haywire.


I stopped all the nodes at the same time, and restarted all of them,
and seems like the problem is fixed.
Cheers
Ramzi


On Thu, Dec 17, 2009 at 8:55 AM, Ramzi Rabah <[email protected]> wrote:
> I added some debugging code to capture the time a read takes
> (getColumnFamily) and the time the road trip weakRemoteRead takes.
> The time it takes to read columns is negligible, so it doesn't seem a
> problem with getColumnFamily. The time it takes for weakRemoteRead
> however is > 5 seconds in some cases. So looking at some more
> debugging output,
> the log indicates that the packets are in the process of being sent by
> weakRemoteRead to the correct target node, but for some reason, the
> target node does not have any reference
> in the log that it handled the get at all.
>
> Couple other things to note:
> 1- I restarted the nodes one after another, while there was traffic
> going to them. Don't know if that will throw off cassandra or that the
> whole thing is a network congestion problem?
> 2- Read stats on the keyspace level indicate NaN value for Read
> latency which seems like a bug?
>
> Thanks
> Ramzi
>
> On Wed, Dec 16, 2009 at 12:07 PM, Jonathan Ellis <[email protected]> wrote:
>> On Wed, Dec 16, 2009 at 12:46 PM, Ramzi Rabah <[email protected]> wrote:
>>> We are observing increasing number of TimedOutExceptions in cassandra
>>> 0.5 trunk although the load seems fairly low (about 400 reads/writes
>>> per second).
>>> cfstats reports that operations are taking less than 2 ms on average.
>>>
>>> 2 Things I have noticed looking at the source code.
>>>
>>> 1- TimedOutExceptions are silently swallowed by Cassandra and not
>>> reported in the logs even at debug level
>>
>> It's reported to the client.  Hardly "swallowed" :)
>>
>>> 2- readstats does not account for these long time running queries that
>>> time out.
>>
>> Right.  But the CF-level stats do.
>>
>>> I'm wondering, what could be causing the system to go haywire like
>>> this?
>>
>> Hard to say without more information.  One shot in the dark is that
>> get_key_range is a major offender sometimes, as well as workloads that
>> do lots of deletes + re-inserts for the same keys.
>>
>> -Jonathan
>>
>

Re: TimedOutException

Reply via email to