Ok I believe the problem is when I was upgrading to a newer build of cassandra, I was upgrading the servers one by one by restarting them. So at one point of time I had some nodes that were 2 days older than the others, and it seems to have caused the inter-node messaging to go haywire.
I stopped all the nodes at the same time, and restarted all of them, and seems like the problem is fixed. Cheers Ramzi On Thu, Dec 17, 2009 at 8:55 AM, Ramzi Rabah <[email protected]> wrote: > I added some debugging code to capture the time a read takes > (getColumnFamily) and the time the road trip weakRemoteRead takes. > The time it takes to read columns is negligible, so it doesn't seem a > problem with getColumnFamily. The time it takes for weakRemoteRead > however is > 5 seconds in some cases. So looking at some more > debugging output, > the log indicates that the packets are in the process of being sent by > weakRemoteRead to the correct target node, but for some reason, the > target node does not have any reference > in the log that it handled the get at all. > > Couple other things to note: > 1- I restarted the nodes one after another, while there was traffic > going to them. Don't know if that will throw off cassandra or that the > whole thing is a network congestion problem? > 2- Read stats on the keyspace level indicate NaN value for Read > latency which seems like a bug? > > Thanks > Ramzi > > On Wed, Dec 16, 2009 at 12:07 PM, Jonathan Ellis <[email protected]> wrote: >> On Wed, Dec 16, 2009 at 12:46 PM, Ramzi Rabah <[email protected]> wrote: >>> We are observing increasing number of TimedOutExceptions in cassandra >>> 0.5 trunk although the load seems fairly low (about 400 reads/writes >>> per second). >>> cfstats reports that operations are taking less than 2 ms on average. >>> >>> 2 Things I have noticed looking at the source code. >>> >>> 1- TimedOutExceptions are silently swallowed by Cassandra and not >>> reported in the logs even at debug level >> >> It's reported to the client. Hardly "swallowed" :) >> >>> 2- readstats does not account for these long time running queries that >>> time out. >> >> Right. But the CF-level stats do. >> >>> I'm wondering, what could be causing the system to go haywire like >>> this? >> >> Hard to say without more information. One shot in the dark is that >> get_key_range is a major offender sometimes, as well as workloads that >> do lots of deletes + re-inserts for the same keys. >> >> -Jonathan >> >
