[drlvm][jni][performance] JNI improvements

Aleksey Shipilev Thu, 27 Sep 2007 03:55:57 -0700

Hi all,

While M3 code is stabilizing, I would like to propose the new vector
of optimizations in DRLVM for M4 - JNI improvements. I held a couple
of investigations and I wish to share them with you:


I. JNI transition

For now, JNI transition is generated with LIL stub and it is obviously
ineffective: try to look in generated code - you will see something
like 70 instructions plus several calls accompanied with push/pop
pairs.

The measurements of JNI transition performance were done with simple
test which calls simple native method (no parameters, returns one
integer) several million times.

Baseline measurements for Harmony/DRLVM:

$ shade.0903.jni.0.clean/bin/java -Xmx128m -Xms128m nalog.nalog
iteration: 0 millis:8532
iteration: 1 millis:8572
iteration: 2 millis:8561

$ jdk1.6.0/bin/java -Xmx128m -Xms128m nalog.nalog
iteration: 0 millis:2560
iteration: 1 millis:2512
iteration: 2 millis:2542

The measurements were done on Windows XP/ia32, but reproducible on
Linux/ia32 too. Note that Harmony is 3.5x slower than Sun 1.6.0.

Here is the list of presumably easy fixes that could help out:

1. http://issues.apache.org/jira/browse/HARMONY-4705
First of all, one could note that pointer to the "thread self" which
is required to obtain some needed parameters, is taken through the
calls to get_vm_thread_ptr() which in place makes several calls
finally reaching hythread_self(). This is due implementation of LIL
codegenerator which expands "ts" mnemonic to this chain of calls. The
idea is simple and most presumably safe: use the thread helper to
obtain TLS.

2. http://issues.apache.org/jira/browse/HARMONY-4714
Moving further, one could note that the exception checking is held
twice: first is generated in lil stub, second is done on pop_m2n. We
could introduce new LIL mnemonic pop_noexp_m2n to solve that problem.
Meanwhile we could inline m2n_(push|pop)_local_handles and have some
benefit from that. The drawback is obvious - we need to take care of
other platforms and implement corresponding LIL codegenerator parts
for them.

3. http://issues.apache.org/jira/browse/HARMONY-4729
Then we come to hythread_suspend_enable()/disable() pair. Each of
these operations take TLS by themselves again. We could again
implement two new LIL mnemonics: "hse" and "hsd" - generate the code
similar to them in LIL codegenerator, effectively inline them and make
use of HARMONY-4705 to quickly obtain TLS. Again, there is only
support for IA32.

Taking all this together:

$ shade.0903.jni.4.cumulative/bin/java -Xmx128m -Xms128m nalog.nalog
iteration: 0 millis:5477
iteration: 1 millis:5528
iteration: 2 millis:5528

...we have up to +55% boost on the microtest, and most interesting,
+14% boost on Dacapo:jython, +10% on Dacapo:pmd and so on. But still,
Harmony is 2.1x slower than Sun 1.6.0. Of course, even these "fixes"
are not enough and generated code still looks ugly (but better :)).

These measurements prove that through optimization of JNI transition
could help us on many workloads – think client workloads with Swing,
server workloads with networking and so on.

There a couple of questions and open issues:

1. Do we really need LIL to generate JNI transition?

1.1. Would it be better to have encoder-based versions on JNI
transition for each of the platform rather than supporting LIL
codegenerator for new mnemonics? To answer that question, I have
rewritten JNI transition on the encoder:
https://issues.apache.org/jira/browse/HARMONY-4806 - (I haven't
managed to get the stack working properly, so it works only for my
benchmarked method)  and get:

$ shade.0903.jni.5.encoder/bin/java -Xmx128m -Xms128m nalog.nalog
iteration: 0 millis:3351
iteration: 1 millis:3362
iteration: 2 millis:3392

The main benefit is proper VM_thread and hythread pointers caching, I guess.

1.2. Would it be more clean to implement some VMMagic that would make
the transition? I don't know what exactly it would cost in terms of
development time, but the benefits are obvious - we have either the
cross-platform implementation or the optimized one.

2. How much allocation/deallocation of local handles costs? I knew
that my method does not allocate anything special, so I simply put the
free_local_handles() out of the transition and have:

$ shade.0903.jni.6.encoder_nofree/bin/java -Xmx128m -Xms128m nalog.nalog
iteration: 0 millis:2841
iteration: 1 millis:2861
iteration: 2 millis:2894

It's so close to RI times :)

Thus, even simple checking of allocated handles count could improve
transition. Note that freeing local handles actually mean traversing
the linked list, so the more handles we have, the more notorious
housekeeping it would be.

II. JNI callbacks

Then we come to JNI callbacks. Typical code for the JNI callback is like:

jbyte JNICALL GetByteFieldOffset(JNIEnv * UNREF jni_env, jobject obj,
jint offset)
{
   assert(hythread_is_suspend_enabled());
   ObjectHandle h = (ObjectHandle)obj;

   if (exn_raised()) return 0;

   tmn_suspend_disable();       //---------------------------------v
   Byte *java_ref = (Byte *)h->object;
   jbyte val = *(jbyte *)(java_ref + offset);
   tmn_suspend_enable();        //---------------------------------^

   return val;
}

Here we see the real work like computing the offset and much of
supplementary work. Most interesting methods are exn_raised(),
tmn_suspend_enable() and tmn_suspend_disable(). Let's dig something
about them:

1. tmn_suspend_enable/disable() are the aliases for
hythread_suspend_enable/disable():

hy_inline void VMCALL hythread_suspend_enable() {
   register hythread_t thread;
   assert(!hythread_is_suspend_enabled());
   thread = tm_self_tls;
   ((HyThread_public *)thread)->disable_count--;
}

tm_self_tls is the alias for hythread_self(). Note that we are taking
TLS two times.

2. exn_raised():

bool exn_raised()
{
   // no need to disable gc for simple null equality check
   return ((NULL != p_TLS_vmthread->thread_exception.exc_object)
       || (NULL != p_TLS_vmthread->thread_exception.exc_class));
}

p_TLS_vmthread is the alias for get_vm_thread_fast_self(), which in
place calls hythread_self() again - 2 times. I've filed
https://issues.apache.org/jira/browse/HARMONY-4811 in respect to this.

So, on much of JNI callbacks we are taking TLS four times!

During the investigation I have spotted an abandoned
https://issues.apache.org/jira/browse/HARMONY-3172 issue that faces
this problem. The idea there is quite elegant – if we need TLS so
frequently, let's store it in the JNIEnv and make the fast calls for
JNI methods! That would surely make the JNI calls lighter.

So, the open issues are:

1. Is it safe to store the TLS in JNIEnv?
2. Is it good and safe to implement TLS-aware methods?

Thanks,
Aleksey Shipilev

[drlvm][jni][performance] JNI improvements

Reply via email to