Tom Rodriguez wrote: > On Aug 8, 2011, at 11:52 AM, Vladimir Kozlov wrote: > >> Christian Thalinger wrote: >>> On Aug 8, 2011, at 4:55 PM, Vladimir Kozlov wrote: >>> >>>> Christian, >>>> >>>> Should we put "skip bytecode quickening" code under flag to do this only >>>> when invoke dynamic is enabled? Or put_code is zero only in invoke dynamic >>>> case? >>> No, it doesn't buy us anything. The new checking code is only executed the >>> first time as the bytecodes are quickened right after that. And in the >>> case where a putfield isn't quickened and we call resolve_get_put it gets >>> very expensive anyway. >> You lost me here. New code in resolve_get_put() is executed only for >> putfield to >> CallSite.target. But new code in patch_bytecode() skips quickening for all >> putfield bytecodes. My question is: can you narrow skipping quickening only >> for >> putfield to CallSite.target? Or you are saying that there is no performance >> difference between executing _aputfield vs _fast_aputfield? > > It only skips quickening if put_code is zero, which is only done for > CallSite.target. All the others proceed as they used to.
Good. Thank you, Tom Vladimir > > tom > >> Vladimir >> >>>> On 8/8/11 6:56 AM, Christian Thalinger wrote: >>>>>> Why on sparc you use ld_ptr() to load from cache but on X86 and X64 you >>>>>> use movl() (only 32 bit)? >>>>> Good question. I took the code from >>>>> TemplateTable::resolve_cache_and_index without thinking about it and that >>>>> one uses ld_ptr. >>>>> >>>>> _indices in CosntantPoolCacheEntry is defined as intx: >>>>> >>>>> volatile intx _indices; // constant pool index& rewrite bytecodes >>>>> >>>>> and bytecode 1 and 2 are in the upper 16-bit of the lower 32-bit word: >>>>> >>>>> // bit number |31 0| >>>>> // bit length |-8--|-8--|---16----| >>>>> // -------------------------------- >>>>> // _indices [ b2 | b1 | index ] >>>>> >>>>> Loading 32-bit on LE gives you the right bits but on BE it does not. I >>>>> think that's the reason for the "optimization" on x64. >>>> I don't like this "optimization" but I understand why we using it. Add a >>>> comment (especially in x64 file). >>> I factored reading the bytecode into >>> InterpreterMacroAssembler::get_cache_and_index_and_bytecode_at_bcp since >>> the same code is used twice in TemplateTable and added the comment there. >>> >>>>>> I am concern about using next short branch in new code in >>>>>> templateTable_sparc.cpp: >>>>>> >>>>>> cmp_and_br_short(..., L_patch_done); // don't patch >>>>>> >>>>>> There is __ stop() call which generates a lot of code so that label >>>>>> L_patch_done could be far. >>>>> Yeah, I thought I give it a try if it works. cmp_and_br_short should >>>>> assert if the branch displacement is too far, right? >>>>> >>>> Yes, it will assert but may be only in some worst case which we do not >>>> test. For example, try to run 64 bit fastdebug VM on Sparc + compressed >>>> oops + VerifyOops. >>> That works. >>> >>>>>> Why you added new #include into ciEnv.cpp and nmethod.cpp, what code >>>>>> needs it? Nothing else is changed in these files. >>>>> Both files use dependencies and I got linkage errors on Linux while >>>>> working on the fix (because of inline methods). It seems that the >>>>> include is not required in ciEnv.cpp because ciEnv.hpp already includes >>>>> it. I missed that. But nmethod.cpp needs it because nmethod.hpp only >>>>> declares class Dependencies. >>>>> >>>> OK. >>>> >>>>>> Why you did not leave "volatile" call site inlining with guard? You did >>>>>> not explain why virtual call is fine for it. >>>>> The spec of MutableCallSite says: >>>>> >>>>> "For target values which will be frequently updated, consider using a >>>>> volatile call site instead." >>>>> >>>>> And VolatileCallSite says: >>>>> >>>>> "A VolatileCallSite is a CallSite whose target acts like a volatile >>>>> variable. An invokedynamic instruction linked to a VolatileCallSite sees >>>>> updates to its call site target immediately, even if the update occurs in >>>>> another thread. There may be a performance penalty for such tight >>>>> coupling between threads. >>>>> >>>>> Unlike MutableCallSite, there is no syncAll operation on volatile call >>>>> sites, since every write to a volatile variable is implicitly >>>>> synchronized with reader threads. >>>>> >>>>> In other respects, a VolatileCallSite is interchangeable with >>>>> MutableCallSite." >>>>> >>>>> Since VolatileCallSite really should only be used when you know the >>>>> target changes very often we don't do optimizations for this case. >>>>> Obviously this is just a guess how people will use VolatileCallSite but I >>>>> think for now this is a safe bet. >>>>> >>>> Thank you for explaining it. >>>> >>>>> Additionally I had to do two small changes because the build was broken >>>>> on some configurations: >>>>> >>>>> - klassOop new_type = _changes.is_klass_change() ? >>>>> _changes.as_klass_change()->new_type() : NULL; >>>>> + klassOop new_type = _changes.is_klass_change() ? >>>>> _changes.as_klass_change()->new_type() : (klassOop) NULL; >>>>> >>>>> and >>>>> >>>>> - MutexLockerEx ccl(CodeCache_lock, thread); >>>>> + MutexLockerEx ccl(CodeCache_lock, Mutex::_no_safepoint_check_flag); >>>>> >>>>> I updated the webrev. >>>> Good. >>> Thanks. >>> >>> -- Christian >>> >>>> Vladimir >>>> >>>>> -- Christian >>>>> >>>>>> Vladimir >>>>>> >>>>>> On 8/5/11 6:32 AM, Christian Thalinger wrote: >>>>>>> http://cr.openjdk.java.net/~twisti/7071653 >>>>>>> >>>>>>> 7071653: JSR 292: call site change notification should be pushed not >>>>>>> pulled >>>>>>> Reviewed-by: >>>>>>> >>>>>>> Currently every speculatively inlined method handle call site has a >>>>>>> guard that compares the current target of the CallSite object to the >>>>>>> inlined one. This per-invocation overhead can be removed if the >>>>>>> notification is changed from pulled to pushed (i.e. deoptimization). >>>>>>> >>>>>>> I had to change the logic in TemplateTable::patch_bytecode to skip >>>>>>> bytecode quickening for putfield instructions when the put_code >>>>>>> written to the constant pool cache is zero. This is required so that >>>>>>> every execution of a putfield to CallSite.target calls out to >>>>>>> InterpreterRuntime::resolve_get_put to do the deoptimization of >>>>>>> depending compiled methods. >>>>>>> >>>>>>> I also had to change the dependency machinery to understand other >>>>>>> dependencies than class hierarchy ones. DepChange got the super-type >>>>>>> of two new dependencies, KlassDepChange and CallSiteDepChange. >>>>>>> >>>>>>> Tested with JRuby tests and benchmarks, hand-written testcases, JDK >>>>>>> tests and vm.mlvm tests. >>>>>>> >>>>>>> Here is the speedup for the JRuby fib benchmark (first is JDK 7 b147, >>>>>>> second with 7071653). Since the CallSite targets don't change during >>>>>>> the runtime of this benchmark we can see the performance benefit of >>>>>>> eliminating the guard: >>>>>>> >>>>>>> $ jruby --server bench/bench_fib_recursive.rb 5 35 >>>>>>> 0.883000 0.000000 0.883000 ( 0.854000) >>>>>>> 0.715000 0.000000 0.715000 ( 0.715000) >>>>>>> 0.712000 0.000000 0.712000 ( 0.712000) >>>>>>> 0.713000 0.000000 0.713000 ( 0.713000) >>>>>>> 0.713000 0.000000 0.713000 ( 0.712000) >>>>>>> >>>>>>> $ jruby --server bench/bench_fib_recursive.rb 5 35 >>>>>>> 0.772000 0.000000 0.772000 ( 0.742000) >>>>>>> 0.624000 0.000000 0.624000 ( 0.624000) >>>>>>> 0.621000 0.000000 0.621000 ( 0.621000) >>>>>>> 0.622000 0.000000 0.622000 ( 0.622000) >>>>>>> 0.622000 0.000000 0.622000 ( 0.621000) >>>>>>> >> _______________________________________________ >> mlvm-dev mailing list >> mlvm-dev@openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev > _______________________________________________ mlvm-dev mailing list mlvm-dev@openjdk.java.net http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev