On Aug 8, 2011, at 11:52 AM, Vladimir Kozlov wrote: > Christian Thalinger wrote: >> On Aug 8, 2011, at 4:55 PM, Vladimir Kozlov wrote: >> >>> Christian, >>> >>> Should we put "skip bytecode quickening" code under flag to do this only >>> when invoke dynamic is enabled? Or put_code is zero only in invoke dynamic >>> case? >> >> No, it doesn't buy us anything. The new checking code is only executed the >> first time as the bytecodes are quickened right after that. And in the case >> where a putfield isn't quickened and we call resolve_get_put it gets very >> expensive anyway. > > You lost me here. New code in resolve_get_put() is executed only for putfield > to > CallSite.target. But new code in patch_bytecode() skips quickening for all > putfield bytecodes. My question is: can you narrow skipping quickening only > for > putfield to CallSite.target? Or you are saying that there is no performance > difference between executing _aputfield vs _fast_aputfield?
It only skips quickening if put_code is zero, which is only done for CallSite.target. All the others proceed as they used to. tom > > Vladimir > >> >>> On 8/8/11 6:56 AM, Christian Thalinger wrote: >>>>> Why on sparc you use ld_ptr() to load from cache but on X86 and X64 you >>>>> use movl() (only 32 bit)? >>>> Good question. I took the code from >>>> TemplateTable::resolve_cache_and_index without thinking about it and that >>>> one uses ld_ptr. >>>> >>>> _indices in CosntantPoolCacheEntry is defined as intx: >>>> >>>> volatile intx _indices; // constant pool index& rewrite bytecodes >>>> >>>> and bytecode 1 and 2 are in the upper 16-bit of the lower 32-bit word: >>>> >>>> // bit number |31 0| >>>> // bit length |-8--|-8--|---16----| >>>> // -------------------------------- >>>> // _indices [ b2 | b1 | index ] >>>> >>>> Loading 32-bit on LE gives you the right bits but on BE it does not. I >>>> think that's the reason for the "optimization" on x64. >>> I don't like this "optimization" but I understand why we using it. Add a >>> comment (especially in x64 file). >> >> I factored reading the bytecode into >> InterpreterMacroAssembler::get_cache_and_index_and_bytecode_at_bcp since the >> same code is used twice in TemplateTable and added the comment there. >> >>>>> I am concern about using next short branch in new code in >>>>> templateTable_sparc.cpp: >>>>> >>>>> cmp_and_br_short(..., L_patch_done); // don't patch >>>>> >>>>> There is __ stop() call which generates a lot of code so that label >>>>> L_patch_done could be far. >>>> Yeah, I thought I give it a try if it works. cmp_and_br_short should >>>> assert if the branch displacement is too far, right? >>>> >>> Yes, it will assert but may be only in some worst case which we do not >>> test. For example, try to run 64 bit fastdebug VM on Sparc + compressed >>> oops + VerifyOops. >> >> That works. >> >>>>> >>>>> Why you added new #include into ciEnv.cpp and nmethod.cpp, what code >>>>> needs it? Nothing else is changed in these files. >>>> Both files use dependencies and I got linkage errors on Linux while >>>> working on the fix (because of inline methods). It seems that the include >>>> is not required in ciEnv.cpp because ciEnv.hpp already includes it. I >>>> missed that. But nmethod.cpp needs it because nmethod.hpp only declares >>>> class Dependencies. >>>> >>> OK. >>> >>>>> Why you did not leave "volatile" call site inlining with guard? You did >>>>> not explain why virtual call is fine for it. >>>> The spec of MutableCallSite says: >>>> >>>> "For target values which will be frequently updated, consider using a >>>> volatile call site instead." >>>> >>>> And VolatileCallSite says: >>>> >>>> "A VolatileCallSite is a CallSite whose target acts like a volatile >>>> variable. An invokedynamic instruction linked to a VolatileCallSite sees >>>> updates to its call site target immediately, even if the update occurs in >>>> another thread. There may be a performance penalty for such tight coupling >>>> between threads. >>>> >>>> Unlike MutableCallSite, there is no syncAll operation on volatile call >>>> sites, since every write to a volatile variable is implicitly synchronized >>>> with reader threads. >>>> >>>> In other respects, a VolatileCallSite is interchangeable with >>>> MutableCallSite." >>>> >>>> Since VolatileCallSite really should only be used when you know the target >>>> changes very often we don't do optimizations for this case. Obviously >>>> this is just a guess how people will use VolatileCallSite but I think for >>>> now this is a safe bet. >>>> >>> Thank you for explaining it. >>> >>>> Additionally I had to do two small changes because the build was broken on >>>> some configurations: >>>> >>>> - klassOop new_type = _changes.is_klass_change() ? >>>> _changes.as_klass_change()->new_type() : NULL; >>>> + klassOop new_type = _changes.is_klass_change() ? >>>> _changes.as_klass_change()->new_type() : (klassOop) NULL; >>>> >>>> and >>>> >>>> - MutexLockerEx ccl(CodeCache_lock, thread); >>>> + MutexLockerEx ccl(CodeCache_lock, Mutex::_no_safepoint_check_flag); >>>> >>>> I updated the webrev. >>> Good. >> >> Thanks. >> >> -- Christian >> >>> Vladimir >>> >>>> -- Christian >>>> >>>>> >>>>> Vladimir >>>>> >>>>> On 8/5/11 6:32 AM, Christian Thalinger wrote: >>>>>> http://cr.openjdk.java.net/~twisti/7071653 >>>>>> >>>>>> 7071653: JSR 292: call site change notification should be pushed not >>>>>> pulled >>>>>> Reviewed-by: >>>>>> >>>>>> Currently every speculatively inlined method handle call site has a >>>>>> guard that compares the current target of the CallSite object to the >>>>>> inlined one. This per-invocation overhead can be removed if the >>>>>> notification is changed from pulled to pushed (i.e. deoptimization). >>>>>> >>>>>> I had to change the logic in TemplateTable::patch_bytecode to skip >>>>>> bytecode quickening for putfield instructions when the put_code >>>>>> written to the constant pool cache is zero. This is required so that >>>>>> every execution of a putfield to CallSite.target calls out to >>>>>> InterpreterRuntime::resolve_get_put to do the deoptimization of >>>>>> depending compiled methods. >>>>>> >>>>>> I also had to change the dependency machinery to understand other >>>>>> dependencies than class hierarchy ones. DepChange got the super-type >>>>>> of two new dependencies, KlassDepChange and CallSiteDepChange. >>>>>> >>>>>> Tested with JRuby tests and benchmarks, hand-written testcases, JDK >>>>>> tests and vm.mlvm tests. >>>>>> >>>>>> Here is the speedup for the JRuby fib benchmark (first is JDK 7 b147, >>>>>> second with 7071653). Since the CallSite targets don't change during >>>>>> the runtime of this benchmark we can see the performance benefit of >>>>>> eliminating the guard: >>>>>> >>>>>> $ jruby --server bench/bench_fib_recursive.rb 5 35 >>>>>> 0.883000 0.000000 0.883000 ( 0.854000) >>>>>> 0.715000 0.000000 0.715000 ( 0.715000) >>>>>> 0.712000 0.000000 0.712000 ( 0.712000) >>>>>> 0.713000 0.000000 0.713000 ( 0.713000) >>>>>> 0.713000 0.000000 0.713000 ( 0.712000) >>>>>> >>>>>> $ jruby --server bench/bench_fib_recursive.rb 5 35 >>>>>> 0.772000 0.000000 0.772000 ( 0.742000) >>>>>> 0.624000 0.000000 0.624000 ( 0.624000) >>>>>> 0.621000 0.000000 0.621000 ( 0.621000) >>>>>> 0.622000 0.000000 0.622000 ( 0.622000) >>>>>> 0.622000 0.000000 0.622000 ( 0.621000) >>>>>> >> > _______________________________________________ > mlvm-dev mailing list > mlvm-dev@openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev _______________________________________________ mlvm-dev mailing list mlvm-dev@openjdk.java.net http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev