Re: MutableCallSite + constant handle slower than field accesses?

Christian Thalinger Mon, 17 Oct 2011 10:40:55 -0700

You are really advocating this :-)

-- Chris


On Oct 17, 2011, at 7:11 PM, Charles Oliver Nutter wrote:

> More justification for this...
> 
> Not inlining the handles would be like invokevirtual not emitting its type 
> check inline and doing that as a separate CALL. Handles are our way to tell 
> the JVM what native operations go with an invokedynamic. Not emitting those 
> operations unconditionally into the compiled code treats invokedynamic as a 
> second class citizen. Fair to say?
> 
> -  Charlie (mobile)
> 
> On Oct 17, 2011 11:52 AM, "Charles Oliver Nutter" <[email protected]> wrote:
> Yeah that does sound like the right approach the more I think about it. The 
> invokedynamic operation and the handles that wire it up should never be 
> separated. Pathology aside, I know the JRuby logic would remain pretty small 
> in almost every case. And pathological cases could be detected with some kind 
> of MH graph-walking calculation (with a suitably large budget since handles 
> individually are maybe a couple native operations each).
> 
> Let me know if you want me to test something out. I *really* hope this can 
> get into u2.
> 
> -  Charlie (mobile)
> 
> On Oct 17, 2011 9:28 AM, "Christian Thalinger" 
> <[email protected]> wrote:
> 
> On Oct 17, 2011, at 4:14 PM, Charles Oliver Nutter wrote:
> 
>> Yeah that's a pretty big problem :) Indy almost becomes a nonstarter if 
>> these degraded cases are not made a lot better. The performance on this 
>> example becomes terrible, and it's common for Ruby methods to be larger and 
>> more complex than this, too.
>> 
> Yeah, I know.
>> You say you don't know how to fix it, so perhaps we can brainstorm a bit? If 
>> we can't inline the call we ate least want it to perform like an uninlined 
>> method call. Could we still compile the MH chain and CALL it, so at least 
>> we're not doing c2i? Or perhaps we should always treat the MH chain as 
>> inlinable regardless of budgets, but not inline the eventual call when 
>> budgets are exceeded? The latter option sounds more correct to me; the MH 
>> chain should be considered a non-removable part of the invokedynamic 
>> operation and always inlined. That would avoid the degradation without 
>> blowing up code size.
>> 
> I didn't say I don't know how to fix it but rather what's the best approach.  
> Internally we (Tom, John and I) already talked about this and I tried to do 
> something what you suggest above:  compile the MH chain and call it.  It 
> turned out it's not that easy (as I thought) and needs some special handling.
> 
> An interesting idea is to always inline the "adaption" code of the MH chain 
> but not the eventual call (or all calls in the chain).  I would guess that 
> for normal usage the added code size is insignificant but there can 
> definitely be pathological cases.  I try to get some data on that.
> 
> -- Chris
>> -  Charlie (mobile)
>> 
>> On Oct 17, 2011 4:58 AM, "Christian Thalinger" 
>> <[email protected]> wrote:
>> 
>> On Oct 15, 2011, at 2:56 PM, Charles Oliver Nutter wrote:
>> 
>> > I'm seeing something peculiar and wanted to run it by you folks.
>> >
>> > There are a few values that JRuby's compiler had previously been
>> > loading from instance fields every time they're needed. Specifically,
>> > fields like ThreadContext.runtime (the current JRuby runtime),
>> > Ruby.falseObject, Ruby.trueObject, Ruby.nilObject (false, true, and
>> > nil values). I figured I'd make a quick change today and have those
>> > instead be constant method handles bound into a mutable call site.
>> >
>> > Unfortunately, performance seems to be worse.
>> >
>> > The logic works like this:
>> >
>> > * ThreadContext is loaded to stack
>> > * invokedynamic, bootstrap just wires up an initialization method into
>> > a MutableCallSite
>> > * initialization method rebinds call site forever to a constant method
>> > handle pointing at the value (runtime/true/false/nil objects)
>> >
>> > My expectation was that this would be at least no slower (and
>> > potentially a tiny bit faster) but also less bytecode (in the case of
>> > true/false/nil, it was previously doing
>> > ThreadContext.runtime.getNil()/getTrue()/getFalse()). It seems like
>> > it's actually slower than walking those references, though, and I'm
>> > not sure why.
>> >
>> > Here's a couple of the scenarios in diff form showing bytecode before
>> > and bytecode after:
>> >
>> > Loading "runtime"
>> >
>> >     ALOAD 1
>> > -    GETFIELD org/jruby/runtime/ThreadContext.runtime : Lorg/jruby/Ruby;
>> > +    INVOKEDYNAMIC getRuntime
>> > (Lorg/jruby/runtime/ThreadContext;)Lorg/jruby/Ruby;
>> > [org/jruby/runtime/invokedynamic/InvokeDynamicSupport.getObjectBootstrap(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/St
>> > ring;Ljava/lang/invoke/MethodType;)Ljava/lang/invoke/CallSite; (6)]
>> >
>> > Loading "false"
>> >
>> >     ALOAD 1
>> > -    GETFIELD org/jruby/runtime/ThreadContext.runtime : Lorg/jruby/Ruby;
>> > -    INVOKEVIRTUAL org/jruby/Ruby.getFalse ()Lorg/jruby/RubyBoolean;
>> > +    INVOKEDYNAMIC getFalse
>> > (Lorg/jruby/runtime/ThreadContext;)Lorg/jruby/RubyBoolean;
>> > [org/jruby/runtime/invokedynamic/InvokeDynamicSupport.getObjectBootstrap(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/String;Ljava/lang/invoke/MethodType;)Ljava/lang/invoke/CallSite;
>> > (6)]
>> >
>> > I think because these are now seen as invocations, I'm hitting some
>> > inlining budget limit I didn't hit before (and which isn't being
>> > properly discounted). The benchmark I'm seeing degrade is
>> > bench/language/bench_flip.rb, and it's a pretty significant
>> > degradation. Only the "heap" version shows the degradation, and it
>> > definitely does have more bytecode...but the bytecode with my patch
>> > differs only in the way these values are being accessed, as shown in
>> > the diffs above.
>> >
>> > Before:
>> >                                              user     system
>> > total        real
>> > 1m x10 while (a)..(!a) (heap)             0.951000   0.000000
>> > 0.951000 (  0.910000)
>> >                                              user     system
>> > total        real
>> > 1m x10 while (a)..(!a) (heap)             0.705000   0.000000
>> > 0.705000 (  0.705000)
>> >                                              user     system
>> > total        real
>> > 1m x10 while (a)..(!a) (heap)             0.688000   0.000000
>> > 0.688000 (  0.688000)
>> >                                              user     system
>> > total        real
>> >
>> > After:
>> >                                              user     system
>> > total        real
>> > 1m x10 while (a)..(!a) (heap)             2.350000   0.000000
>> > 2.350000 (  2.284000)
>> >                                              user     system
>> > total        real
>> > 1m x10 while (a)..(!a) (heap)             2.128000   0.000000
>> > 2.128000 (  2.128000)
>> >                                              user     system
>> > total        real
>> > 1m x10 while (a)..(!a) (heap)             2.115000   0.000000
>> > 2.115000 (  2.116000)
>> >                                              user     system
>> > total        real
>> >
>> > You can see the degradation is pretty bad.
>> >
>> > I'm concerned because I had hoped that invokedynamic + mutable call
>> > site + constant handle would always be faster than a field
>> > access...since it avoids excessive field accesses and makes it
>> > possible for Hotspot to fold those constants away. What's going on
>> > here?
>> 
>> I looked into this and the main issue here is an old friend:  slow invokes 
>> of non-inlined MH call sites.  The problem is that you trade a normal invoke 
>> (to a field load?) with a MH invoke.  If the normal invoke doesn't get 
>> inlined we're good but if the MH invoke doesn't get inlined we're screwed 
>> (since we are still doing the C2I-I2C dance).
>> 
>> I refactored the benchmark a little (moved stack and heap loops into its own 
>> methods and only do 5 while-loops instead of 11; that inlines all calls in 
>> that method) and the performance is like you had expected (a little faster):
>> 
>> 32-bit:
>> 
>> before:
>> 
>> 1m x10 while (a)..(!a) (stack)            0.214000   0.000000   0.214000 (  
>> 0.214000)
>> 1m x10 while (a)..(!a) (heap)             0.249000   0.000000   0.249000 (  
>> 0.250000)
>> 
>> after:
>> 
>> 1m x10 while (a)..(!a) (stack)            0.203000   0.000000   0.203000 (  
>> 0.203000)
>> 1m x10 while (a)..(!a) (heap)             0.234000   0.000000   0.234000 (  
>> 0.234000)
>> 
>> 64-bit:
>> 
>> before:
>> 
>> 1m x10 while (a)..(!a) (stack)            0.248000   0.000000   0.248000 (  
>> 0.248000)
>> 1m x10 while (a)..(!a) (heap)             0.257000   0.000000   0.257000 (  
>> 0.257000)
>> 
>> after:
>> 
>> 1m x10 while (a)..(!a) (stack)            0.226000   0.000000   0.226000 (  
>> 0.226000)
>> 1m x10 while (a)..(!a) (heap)             0.244000   0.000000   0.244000 (  
>> 0.244000)
>> 
>> We have to fix that but I'm not sure yet what's the best approach.  Sorry I 
>> don't have better news for now.
>> 
>> -- Chris
>> 
>> >
>> > Patch for the change (apply to JRuby master) is here:
>> > https://gist.github.com/955976b52b0c4e3f611e
>> >
>> > - Charlie
>> > _______________________________________________
>> > mlvm-dev mailing list
>> > [email protected]
>> > http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>> 
>> _______________________________________________
>> mlvm-dev mailing list
>> [email protected]
>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>> _______________________________________________
>> mlvm-dev mailing list
>> [email protected]
>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
> 
> 
> _______________________________________________
> mlvm-dev mailing list
> [email protected]
> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
> 
> _______________________________________________
> mlvm-dev mailing list
> [email protected]
> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev

_______________________________________________
mlvm-dev mailing list
[email protected]
http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev

Re: MutableCallSite + constant handle slower than field accesses?

Reply via email to