Once a node gets over 100% load, it never comes back down again, and it
stops doing any more work. The reason for this is presumably that the
connections on which it might get a response to one of its queries have
either fallen down or are saturated with QueryRejects.
Possible solutions?
My suggestion: restart the node, error out or start killing random
threads, if the threadlimit is exceeded. Increase the threadlimit to the
maximum that the JVM can handle (with green threads, used eg in Kaffe, this
is a lot). Use something else to determine load and request triage.

So if threads used > 80% of threadlimit OR
   average tickerdelay for last minute > 3000ms,
then QR all new requests (except those in failtable)

If threads used > 70% OR
   average tickerdelay > 1000ms
then QR all requests except those in datastore

If threads used > 60% OR
   average tickerdelay > 500ms
then only accept requests in the most successful part of the keyspace.

GJ: PLEASE commit your Psmin patch, we need to get rid of the
most-successful-part-of-the-keyspace-hack.

Then we would have:

If threads used > 80% of threadlimit OR
   average tickerdelay for last minute > 3000ms,
   then QR all new requests (except those in failtable)

If threads used > 60% of threadlmit OR
   average tickerdelay > 500ms,
   then QR all requests except those where Ps(k) > Psmin

I am still looking for the relevant diagnostics code...
And I have no idea how to get the maximum number of threads for a given
JVM.

Attachment: msg03378/pgp00000.pgp
Description: PGP signature

Reply via email to