Once a node gets over 100% load, it never comes back down again, and it stops doing any more work. The reason for this is presumably that the connections on which it might get a response to one of its queries have either fallen down or are saturated with QueryRejects. Possible solutions? My suggestion: restart the node, error out or start killing random threads, if the threadlimit is exceeded. Increase the threadlimit to the maximum that the JVM can handle (with green threads, used eg in Kaffe, this is a lot). Use something else to determine load and request triage.
So if threads used > 80% of threadlimit OR average tickerdelay for last minute > 3000ms, then QR all new requests (except those in failtable) If threads used > 70% OR average tickerdelay > 1000ms then QR all requests except those in datastore If threads used > 60% OR average tickerdelay > 500ms then only accept requests in the most successful part of the keyspace. GJ: PLEASE commit your Psmin patch, we need to get rid of the most-successful-part-of-the-keyspace-hack. Then we would have: If threads used > 80% of threadlimit OR average tickerdelay for last minute > 3000ms, then QR all new requests (except those in failtable) If threads used > 60% of threadlmit OR average tickerdelay > 500ms, then QR all requests except those where Ps(k) > Psmin I am still looking for the relevant diagnostics code... And I have no idea how to get the maximum number of threads for a given JVM.
msg03378/pgp00000.pgp
Description: PGP signature
