Brian Goetz said the following on 12/06/10 06:22:
Its worth noting that the real performance offender here is that thread states are enums (and therefore object references). If they were integers, then the entire thing could most likely be done in a way that is guaranteed branch-free and dcache-miss-free.

Agree with Eamonn that the current sequence-of-tests is probably good enough. Might be worth doing a little profiling to choose optimal order of tests; I would guess that the right order is RUNNABLE, WAITING, TIMED_WAITING, BLOCKED, NEW, TERMINATED, but that's just an OOMH (out-of-my-hat) guess.

I had already had this discussion with Mandy, including the idea that a lookup table would be faster - if there were a simple way to construct it. The current order of tests is based on our discussion. No doubt Runnable should be first, and new/terminated last. The rest of somewhat subjective. As sync is used more than wait/notify then BLOCKED would be next most likely. For waiting vs timed-waiting it depends on how defensively people code their waits :)

Anyway, this was all premised on slow performance that Doug observed as part of ForkJoin mechanics. It would be good to hear back from him on how this updated approach performs. If the FJ code has changed such that this is no longer an issue then I would suggest that Mandy's changes are "good enough" and we let her move on.

Cheers,
David

 There are probably not enough bits here to
justify a binary search (which trades off best-case against average/worst-case time, but might turn up a combination of bits that has better branch prediction characteristics.)


On 12/5/2010 2:00 PM, Eamonn McManus wrote:
Yeah, it was a bit blithe of me to write that the sequence of tests was
faster. In the table-lookup version, if you get rid of the initial test for RUNNABLE, and if you use Integer.numberOfLeadingZeros, and if the JIT compiler intrinsifies that to a native processor instruction, and if the lookup table is in L1 cache, then the table-lookup version will run in constant time and be
better than the worst case of the sequence-of-tests version, and probably
better than the average case too. But, as you say, that last /if/ (the cache hit) will usually not be true, and in that case I would not be surprised if
the sequence of tests were faster even in its worst case.

Anyway the sequence-of-tests version is unquestionably simpler, and I would venture that the best solution is probably to go with that, plus a new method in the API that explicitly tests whether a thread is runnable. That's trivial to implement now that Mandy has pulled the knowledge of state bits into the
Java code rather than being hidden in the bowels of the VM; and its
implementation will be faster than (Thread.getState() == RUNNABLE) regardless
of the implementation of the latter.

Éamonn


On 5/12/10 8:27 AM, Brian Goetz wrote:
As Eamonn writes it, it will never cache miss but may frequently branch
mispredict (possibly multiple times). If you do a shift + mask + index into a small table, it will cache miss most the time but never branch mispredict. (In a real program it will cache miss frequently since thread state calls
are infrequent and the lookup table will fall out of cache; in a
microbenchmark it will almost never cache miss as the lookup table will be
hot.)

On 12/4/2010 7:22 AM, Eamonn McManus wrote:
Hi Mandy,

This test:

         if ((threadStatus&  JVMTI_THREAD_STATE_RUNNABLE) == 1) {

is always false, since JVMTI_THREAD_STATE_RUNNABLE is 4. (NetBeans 7.0
helpfully flags this; I'm not sure if earlier versions do.)

But, once corrected, I think you could use this idea further to write a much
simpler and faster method, on these lines:

     public static Thread.State toThreadState(int threadStatus) {
         if ((threadStatus&  JVMTI_THREAD_STATE_RUNNABLE)*!= 0*) {
             return RUNNABLE;
         } else if ((threadStatus&
JVMTI_THREAD_STATE_BLOCKED_ON_MONITOR_ENTER) != 0) {
             return BLOCKED;
         } else if ((threadStatus&
JVMTI_THREAD_STATE_WAITING_WITH_TIMEOUT) != 0) {
             return TIMED_WAITING;
         } else if ((threadStatus&
JVMTI_THREAD_STATE_WAITING_INDEFINITELY) != 0) {
             return WAITING;
} else if ((threadStatus& JVMTI_THREAD_STATE_TERMINATED) != 0) {
             return TERMINATED;
         } else {
             return NEW;
         }
     }

You could tweak the order of the tests based on what might be the relative
frequency of the different states but it probably isn't worth it.

Regards,

Éamonn


On 3/12/10 11:52 PM, Mandy Chung wrote:
Fix for 6977034: Thread.getState() very slow

Webrev at:
http://cr.openjdk.java.net/~mchung/6977034/webrev.00/

This is an improvement to map a Thread's threadStatus field to Thread.State. The VM updates the Thread.threadStatus field directly at state transition with the value as defined in JVM TI [1]. The java.lang.Thread.getState() implementation can directly access the threadStatus value and do a direct lookup from an array of Thread.State. The threadStatus value is a bit vector and we would have to create an array of a minimum of 1061 (0x425) elements to do direct mapping. I took the approach to use the first highest order bit set to 1 in the masked threadStatus value as the index to the Thread.State
element and only caches 32 elements (could be fewer). I wrote a
micro-benchmark measuring the Thread.getState of a thread in different state that shows 1.7X to 6X speedup (see below). There is possibly some issue with my micro-benchmark that I didn't observe the 14X speed up as Doug did in his
experiment. However, I'd like to get this reviewed and pushed to the
repository so that anyone can do more experiment on the performance
measurement.

Thanks
Mandy
P.S. The discussion on this thread can be found at [2] [3].

[1]
http://download.java.net/jdk7/docs/platform/jvmti/jvmti.html#GetThreadState [2] http://mail.openjdk.java.net/pipermail/core-libs-dev/2010-July/004567.html
[3]
http://mail.openjdk.java.net/pipermail/core-libs-dev/2010-August/004721.html


    JDK 7 b120 (in ms)    With fix (in ms)    Speed up
main        46465            22772            2.04
NEW        50676        29921            1.69
RUNNABLE    42202        14690            2.87
BLOCKED        72773        12296            5.92
WAITING        48811        13041            3.74
TIMED_WAITING    45737        12849            3.56
TERMINATED    40314        16376            2.46

Reply via email to