Hi all, While M3 code is stabilizing, I would like to propose the new vector of optimizations in DRLVM for M4 - JNI improvements. I held a couple of investigations and I wish to share them with you:
I. JNI transition For now, JNI transition is generated with LIL stub and it is obviously ineffective: try to look in generated code - you will see something like 70 instructions plus several calls accompanied with push/pop pairs. The measurements of JNI transition performance were done with simple test which calls simple native method (no parameters, returns one integer) several million times. Baseline measurements for Harmony/DRLVM: $ shade.0903.jni.0.clean/bin/java -Xmx128m -Xms128m nalog.nalog iteration: 0 millis:8532 iteration: 1 millis:8572 iteration: 2 millis:8561 $ jdk1.6.0/bin/java -Xmx128m -Xms128m nalog.nalog iteration: 0 millis:2560 iteration: 1 millis:2512 iteration: 2 millis:2542 The measurements were done on Windows XP/ia32, but reproducible on Linux/ia32 too. Note that Harmony is 3.5x slower than Sun 1.6.0. Here is the list of presumably easy fixes that could help out: 1. http://issues.apache.org/jira/browse/HARMONY-4705 First of all, one could note that pointer to the "thread self" which is required to obtain some needed parameters, is taken through the calls to get_vm_thread_ptr() which in place makes several calls finally reaching hythread_self(). This is due implementation of LIL codegenerator which expands "ts" mnemonic to this chain of calls. The idea is simple and most presumably safe: use the thread helper to obtain TLS. 2. http://issues.apache.org/jira/browse/HARMONY-4714 Moving further, one could note that the exception checking is held twice: first is generated in lil stub, second is done on pop_m2n. We could introduce new LIL mnemonic pop_noexp_m2n to solve that problem. Meanwhile we could inline m2n_(push|pop)_local_handles and have some benefit from that. The drawback is obvious - we need to take care of other platforms and implement corresponding LIL codegenerator parts for them. 3. http://issues.apache.org/jira/browse/HARMONY-4729 Then we come to hythread_suspend_enable()/disable() pair. Each of these operations take TLS by themselves again. We could again implement two new LIL mnemonics: "hse" and "hsd" - generate the code similar to them in LIL codegenerator, effectively inline them and make use of HARMONY-4705 to quickly obtain TLS. Again, there is only support for IA32. Taking all this together: $ shade.0903.jni.4.cumulative/bin/java -Xmx128m -Xms128m nalog.nalog iteration: 0 millis:5477 iteration: 1 millis:5528 iteration: 2 millis:5528 ...we have up to +55% boost on the microtest, and most interesting, +14% boost on Dacapo:jython, +10% on Dacapo:pmd and so on. But still, Harmony is 2.1x slower than Sun 1.6.0. Of course, even these "fixes" are not enough and generated code still looks ugly (but better :)). These measurements prove that through optimization of JNI transition could help us on many workloads – think client workloads with Swing, server workloads with networking and so on. There a couple of questions and open issues: 1. Do we really need LIL to generate JNI transition? 1.1. Would it be better to have encoder-based versions on JNI transition for each of the platform rather than supporting LIL codegenerator for new mnemonics? To answer that question, I have rewritten JNI transition on the encoder: https://issues.apache.org/jira/browse/HARMONY-4806 - (I haven't managed to get the stack working properly, so it works only for my benchmarked method) and get: $ shade.0903.jni.5.encoder/bin/java -Xmx128m -Xms128m nalog.nalog iteration: 0 millis:3351 iteration: 1 millis:3362 iteration: 2 millis:3392 The main benefit is proper VM_thread and hythread pointers caching, I guess. 1.2. Would it be more clean to implement some VMMagic that would make the transition? I don't know what exactly it would cost in terms of development time, but the benefits are obvious - we have either the cross-platform implementation or the optimized one. 2. How much allocation/deallocation of local handles costs? I knew that my method does not allocate anything special, so I simply put the free_local_handles() out of the transition and have: $ shade.0903.jni.6.encoder_nofree/bin/java -Xmx128m -Xms128m nalog.nalog iteration: 0 millis:2841 iteration: 1 millis:2861 iteration: 2 millis:2894 It's so close to RI times :) Thus, even simple checking of allocated handles count could improve transition. Note that freeing local handles actually mean traversing the linked list, so the more handles we have, the more notorious housekeeping it would be. II. JNI callbacks Then we come to JNI callbacks. Typical code for the JNI callback is like: jbyte JNICALL GetByteFieldOffset(JNIEnv * UNREF jni_env, jobject obj, jint offset) { assert(hythread_is_suspend_enabled()); ObjectHandle h = (ObjectHandle)obj; if (exn_raised()) return 0; tmn_suspend_disable(); //---------------------------------v Byte *java_ref = (Byte *)h->object; jbyte val = *(jbyte *)(java_ref + offset); tmn_suspend_enable(); //---------------------------------^ return val; } Here we see the real work like computing the offset and much of supplementary work. Most interesting methods are exn_raised(), tmn_suspend_enable() and tmn_suspend_disable(). Let's dig something about them: 1. tmn_suspend_enable/disable() are the aliases for hythread_suspend_enable/disable(): hy_inline void VMCALL hythread_suspend_enable() { register hythread_t thread; assert(!hythread_is_suspend_enabled()); thread = tm_self_tls; ((HyThread_public *)thread)->disable_count--; } tm_self_tls is the alias for hythread_self(). Note that we are taking TLS two times. 2. exn_raised(): bool exn_raised() { // no need to disable gc for simple null equality check return ((NULL != p_TLS_vmthread->thread_exception.exc_object) || (NULL != p_TLS_vmthread->thread_exception.exc_class)); } p_TLS_vmthread is the alias for get_vm_thread_fast_self(), which in place calls hythread_self() again - 2 times. I've filed https://issues.apache.org/jira/browse/HARMONY-4811 in respect to this. So, on much of JNI callbacks we are taking TLS four times! During the investigation I have spotted an abandoned https://issues.apache.org/jira/browse/HARMONY-3172 issue that faces this problem. The idea there is quite elegant – if we need TLS so frequently, let's store it in the JNIEnv and make the fast calls for JNI methods! That would surely make the JNI calls lighter. So, the open issues are: 1. Is it safe to store the TLS in JNIEnv? 2. Is it good and safe to implement TLS-aware methods? Thanks, Aleksey Shipilev
