I have pushed a change to JRuby master that fails over to a simple inline cache after one failed GWT. This is not how I would want to do it long-term, but it does bring some benchmarks back in line with non-indy JRuby. Specifically, bench_richards 1000000 now runs in a normal amount of time...but it's not faster than non-indy just yet (it was *much* slower before). The simple bimorphic benchmark I posted earlier also performs well...and runs slightly faster than non-indy.
In case it wasn't obvious: yay! I have not yet implemented a bimorphic (or greater) PIC chain yet, but that will come next. Regarding why this is not sufficient: Since it's a hard failure, any class modification that cascades into the cached method's class will kill it permanently. I need some wiggle room in place to avoid the chaotic early boot of an application from nuking a bunch of call sites. For example...if someone monkey-patches String#gsub, any call sites that have cached the non-patched gsub will permanently fail to IC. Ideally, they'd only fail permanently if a call site remained forever polymorphic...and even better would be to go back to GWT if a call site appears to be polymorphic for a while but is actually monomorphic in steady state (I don't think Hotspot does this...deopt is only triggered when guards fail, and there's no speculative re-profiling after reaching steady state). I'd be interested in discussing fail strategies with others on this list. Specifically, I'm debating a number of aspects of JRuby's invokedynamic use: * Number of GWT in a "PIC chain" * Number of failed GWT to give up on the PIC chain * Combination of PIC chain and IC * MH usage in the IC (currently the IC "fail" logic in JRuby's indy stuff is the same in general as CachingCallSite logic) * Strategies for preventing the chaotic boot time from damaging call sites unnecessarily Basically the challenge is balancing inline caches, GWT-based indy dispatch, GWT-based PIC, and class structure volatility to get the best possible performance once a stable state is reached (and not having late-run class structure changes from doing permanent damage). I'd really love to hear from John or anyone else familiar with JVM strategies for deopt. I know Hotspot will inline monomorphic and bimorphic sites, and anything higher ends up turning into a CALL (vtable? IC?) This is fun :) - Charlie On Fri, Apr 29, 2011 at 3:12 PM, Charles Oliver Nutter <head...@headius.com> wrote: > I think my strategy is going to be something like this: > > * Bind first method encountered straight through with a new GWT > * Upon encountering nth new method, rebuild GWT chain to check first, > second, nth PIC-style > * For some N encountered methods, failover to a simple IC > > Some variation of these (profiling GWT misses versus hits, direct call > + IC chain, etc) should work around the GWT creation overhead. > > I suppose you could pre-create GWTs, but that seems more cumbersome > since in many/most cases the test will be time-sensitive (e.g. in > JRuby where it's based on the current "generation" of a class, which > will change over time). > > For monomorphic, GWT + direct handle will certainly be faster than > CachingCallSite. For some N-morphic, a GWT "PIC chain" will still be > faster than CCS. And then the failover should be no slower than CCS. > Not a bad trade-off. > > - Charlie > > On Fri, Apr 29, 2011 at 2:59 PM, Ola Bini <ola.b...@gmail.com> wrote: >> Hi, >> >> Given that creating GWTs are expensive, is it a really bad idea to >> create them and bind them on a cache miss then? My current logic for >> call sites look something like this: >> >> invoke call site >> if fallback, check if current morphism is < 10. >> If so, create a new GWT with the currently found method and >> appropriate test. >> >> How would you recommend doing this without creating GWTs at runtime? >> Having ten slots in the call site and precreate the GWTs that use them? >> >> Cheers >> >> On 2011-04-29 09.59, Rémi Forax wrote: >>> On 04/28/2011 09:58 PM, Charles Oliver Nutter wrote: >>>> I'm trying to figure out why polymorphic dispatch is incredibly slow >>>> in JRuby + indy. Take this benchmark, for example: >>>> >>>> class A; def foo; end; end >>>> class B; def foo; end; end >>>> >>>> a = A.new >>>> b = B.new >>>> >>>> 5.times { puts Benchmark.measure { 1000000.times { a, b = b, a; a.foo; >>>> b.foo } } } >>>> >>>> a.foo and b.foo are bimorphic here. Under stock JRuby, using >>>> CachingCallSite, this benchmark runs in about 0.13s per iteration. >>>> Using invokedynamic, it takes 9s!!! >>>> >>>> This is after a patch I just committed that caches the target method >>>> handle for direct paths. I believe the only thing created when GWT >>>> fails now is a new GWT. >>> >>> If you want to emulate a bimorphic cache, you should have two GWTs. >>> So no construction of new GWT after discovering all possible targets >>> for the two callsites. >>> >>> Relying on a mutable MethodHandle, a method handle that change >>> for every call will not work well because the JIT will not be able to >>> inline through this mutable method handle. >>> >>>> Is it expected that rebinding a call site or constructing a GWT would >>>> be very expensive? If yes...I will have to look into having a hard >>>> failover to inline caching or a PIC-like handle chain for polymorphic >>>> cases. That's not necessarily difficult. If no...I'm happy to update >>>> my build and play with patches to see what's happening here. >>> >>> Yes, it's expensive. >>> The target of a CallSite should be stable. >>> So yes it's expensible and yes it's intended. >>> >>>> A sampled profile produced the following output: >>>> >>>> Stub + native Method >>>> 57.6% 0 + 5214 java.lang.invoke.MethodHandleNatives.init >>>> 30.9% 0 + 2798 java.lang.invoke.MethodHandleNatives.init >>>> 2.1% 0 + 189 java.lang.invoke.MethodHandleNatives.getTarget >>>> 0.1% 0 + 7 java.lang.Object.getClass >>>> 0.0% 0 + 3 java.lang.Class.isPrimitive >>>> 0.0% 0 + 3 java.lang.System.arraycopy >>>> 90.7% 0 + 8214 Total stub >>>> >>>> Of course we all know how accurate sampled profiles are, but this is >>>> pretty a pretty dismal result. >>>> >>>> I suspect that this polymorphic cost is a *major* factor in slowing >>>> down some benchmarks under invokedynamic. FWIW, the above benchmark >>>> without the a,b swap runs in 0.06s, better than 2x faster than stock >>>> JRuby (yay!). >>>> >>>> - Charlie >>> >>> Rémi >>> >>> _______________________________________________ >>> mlvm-dev mailing list >>> mlvm-dev@openjdk.java.net >>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev >>> >> >> >> -- >> Ola Bini (http://olabini.com) >> Ioke - JRuby - ThoughtWorks >> >> "Yields falsehood when quined" yields falsehood when quined. >> >> _______________________________________________ >> mlvm-dev mailing list >> mlvm-dev@openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev >> > _______________________________________________ mlvm-dev mailing list mlvm-dev@openjdk.java.net http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev