I added some debugging code to capture the time a read takes (getColumnFamily) and the time the road trip weakRemoteRead takes. The time it takes to read columns is negligible, so it doesn't seem a problem with getColumnFamily. The time it takes for weakRemoteRead however is > 5 seconds in some cases. So looking at some more debugging output, the log indicates that the packets are in the process of being sent by weakRemoteRead to the correct target node, but for some reason, the target node does not have any reference in the log that it handled the get at all.
Couple other things to note: 1- I restarted the nodes one after another, while there was traffic going to them. Don't know if that will throw off cassandra or that the whole thing is a network congestion problem? 2- Read stats on the keyspace level indicate NaN value for Read latency which seems like a bug? Thanks Ramzi On Wed, Dec 16, 2009 at 12:07 PM, Jonathan Ellis <[email protected]> wrote: > On Wed, Dec 16, 2009 at 12:46 PM, Ramzi Rabah <[email protected]> wrote: >> We are observing increasing number of TimedOutExceptions in cassandra >> 0.5 trunk although the load seems fairly low (about 400 reads/writes >> per second). >> cfstats reports that operations are taking less than 2 ms on average. >> >> 2 Things I have noticed looking at the source code. >> >> 1- TimedOutExceptions are silently swallowed by Cassandra and not >> reported in the logs even at debug level > > It's reported to the client. Hardly "swallowed" :) > >> 2- readstats does not account for these long time running queries that >> time out. > > Right. But the CF-level stats do. > >> I'm wondering, what could be causing the system to go haywire like >> this? > > Hard to say without more information. One shot in the dark is that > get_key_range is a major offender sometimes, as well as workloads that > do lots of deletes + re-inserts for the same keys. > > -Jonathan >
