Re: MutableCallSite + constant handle slower than field accesses?

Charles Oliver Nutter Mon, 17 Oct 2011 13:31:43 -0700

It seems like the natural solution! :) Invokedynamic is nothing
without the handles wiring it up...so they should always live happily
together in the land of fairies and unicorns.


- Charlie

On Mon, Oct 17, 2011 at 5:32 PM, Christian Thalinger
<christian.thalin...@oracle.com> wrote:
> You are really advocating this :-)
> -- Chris
> On Oct 17, 2011, at 7:11 PM, Charles Oliver Nutter wrote:
>
> More justification for this...
>
> Not inlining the handles would be like invokevirtual not emitting its type
> check inline and doing that as a separate CALL. Handles are our way to tell
> the JVM what native operations go with an invokedynamic. Not emitting those
> operations unconditionally into the compiled code treats invokedynamic as a
> second class citizen. Fair to say?
>
> -  Charlie (mobile)
>
> On Oct 17, 2011 11:52 AM, "Charles Oliver Nutter" <head...@headius.com>
> wrote:
>>
>> Yeah that does sound like the right approach the more I think about it.
>> The invokedynamic operation and the handles that wire it up should never be
>> separated. Pathology aside, I know the JRuby logic would remain pretty small
>> in almost every case. And pathological cases could be detected with some
>> kind of MH graph-walking calculation (with a suitably large budget since
>> handles individually are maybe a couple native operations each).
>>
>> Let me know if you want me to test something out. I *really* hope this can
>> get into u2.
>>
>> -  Charlie (mobile)
>>
>> On Oct 17, 2011 9:28 AM, "Christian Thalinger"
>> <christian.thalin...@oracle.com> wrote:
>>>
>>> On Oct 17, 2011, at 4:14 PM, Charles Oliver Nutter wrote:
>>>
>>> Yeah that's a pretty big problem :) Indy almost becomes a nonstarter if
>>> these degraded cases are not made a lot better. The performance on this
>>> example becomes terrible, and it's common for Ruby methods to be larger and
>>> more complex than this, too.
>>>
>>> Yeah, I know.
>>>
>>> You say you don't know how to fix it, so perhaps we can brainstorm a bit?
>>> If we can't inline the call we ate least want it to perform like an
>>> uninlined method call. Could we still compile the MH chain and CALL it, so
>>> at least we're not doing c2i? Or perhaps we should always treat the MH chain
>>> as inlinable regardless of budgets, but not inline the eventual call when
>>> budgets are exceeded? The latter option sounds more correct to me; the MH
>>> chain should be considered a non-removable part of the invokedynamic
>>> operation and always inlined. That would avoid the degradation without
>>> blowing up code size.
>>>
>>> I didn't say I don't know how to fix it but rather what's the best
>>> approach.  Internally we (Tom, John and I) already talked about this and I
>>> tried to do something what you suggest above:  compile the MH chain and call
>>> it.  It turned out it's not that easy (as I thought) and needs some special
>>> handling.
>>> An interesting idea is to always inline the "adaption" code of the MH
>>> chain but not the eventual call (or all calls in the chain).  I would guess
>>> that for normal usage the added code size is insignificant but there can
>>> definitely be pathological cases.  I try to get some data on that.
>>> -- Chris
>>>
>>> -  Charlie (mobile)
>>>
>>> On Oct 17, 2011 4:58 AM, "Christian Thalinger"
>>> <christian.thalin...@oracle.com> wrote:
>>>>
>>>> On Oct 15, 2011, at 2:56 PM, Charles Oliver Nutter wrote:
>>>>
>>>> > I'm seeing something peculiar and wanted to run it by you folks.
>>>> >
>>>> > There are a few values that JRuby's compiler had previously been
>>>> > loading from instance fields every time they're needed. Specifically,
>>>> > fields like ThreadContext.runtime (the current JRuby runtime),
>>>> > Ruby.falseObject, Ruby.trueObject, Ruby.nilObject (false, true, and
>>>> > nil values). I figured I'd make a quick change today and have those
>>>> > instead be constant method handles bound into a mutable call site.
>>>> >
>>>> > Unfortunately, performance seems to be worse.
>>>> >
>>>> > The logic works like this:
>>>> >
>>>> > * ThreadContext is loaded to stack
>>>> > * invokedynamic, bootstrap just wires up an initialization method into
>>>> > a MutableCallSite
>>>> > * initialization method rebinds call site forever to a constant method
>>>> > handle pointing at the value (runtime/true/false/nil objects)
>>>> >
>>>> > My expectation was that this would be at least no slower (and
>>>> > potentially a tiny bit faster) but also less bytecode (in the case of
>>>> > true/false/nil, it was previously doing
>>>> > ThreadContext.runtime.getNil()/getTrue()/getFalse()). It seems like
>>>> > it's actually slower than walking those references, though, and I'm
>>>> > not sure why.
>>>> >
>>>> > Here's a couple of the scenarios in diff form showing bytecode before
>>>> > and bytecode after:
>>>> >
>>>> > Loading "runtime"
>>>> >
>>>> >     ALOAD 1
>>>> > -    GETFIELD org/jruby/runtime/ThreadContext.runtime :
>>>> > Lorg/jruby/Ruby;
>>>> > +    INVOKEDYNAMIC getRuntime
>>>> > (Lorg/jruby/runtime/ThreadContext;)Lorg/jruby/Ruby;
>>>> >
>>>> > [org/jruby/runtime/invokedynamic/InvokeDynamicSupport.getObjectBootstrap(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/St
>>>> > ring;Ljava/lang/invoke/MethodType;)Ljava/lang/invoke/CallSite; (6)]
>>>> >
>>>> > Loading "false"
>>>> >
>>>> >     ALOAD 1
>>>> > -    GETFIELD org/jruby/runtime/ThreadContext.runtime :
>>>> > Lorg/jruby/Ruby;
>>>> > -    INVOKEVIRTUAL org/jruby/Ruby.getFalse ()Lorg/jruby/RubyBoolean;
>>>> > +    INVOKEDYNAMIC getFalse
>>>> > (Lorg/jruby/runtime/ThreadContext;)Lorg/jruby/RubyBoolean;
>>>> >
>>>> > [org/jruby/runtime/invokedynamic/InvokeDynamicSupport.getObjectBootstrap(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/String;Ljava/lang/invoke/MethodType;)Ljava/lang/invoke/CallSite;
>>>> > (6)]
>>>> >
>>>> > I think because these are now seen as invocations, I'm hitting some
>>>> > inlining budget limit I didn't hit before (and which isn't being
>>>> > properly discounted). The benchmark I'm seeing degrade is
>>>> > bench/language/bench_flip.rb, and it's a pretty significant
>>>> > degradation. Only the "heap" version shows the degradation, and it
>>>> > definitely does have more bytecode...but the bytecode with my patch
>>>> > differs only in the way these values are being accessed, as shown in
>>>> > the diffs above.
>>>> >
>>>> > Before:
>>>> >                                              user     system
>>>> > total        real
>>>> > 1m x10 while (a)..(!a) (heap)             0.951000   0.000000
>>>> > 0.951000 (  0.910000)
>>>> >                                              user     system
>>>> > total        real
>>>> > 1m x10 while (a)..(!a) (heap)             0.705000   0.000000
>>>> > 0.705000 (  0.705000)
>>>> >                                              user     system
>>>> > total        real
>>>> > 1m x10 while (a)..(!a) (heap)             0.688000   0.000000
>>>> > 0.688000 (  0.688000)
>>>> >                                              user     system
>>>> > total        real
>>>> >
>>>> > After:
>>>> >                                              user     system
>>>> > total        real
>>>> > 1m x10 while (a)..(!a) (heap)             2.350000   0.000000
>>>> > 2.350000 (  2.284000)
>>>> >                                              user     system
>>>> > total        real
>>>> > 1m x10 while (a)..(!a) (heap)             2.128000   0.000000
>>>> > 2.128000 (  2.128000)
>>>> >                                              user     system
>>>> > total        real
>>>> > 1m x10 while (a)..(!a) (heap)             2.115000   0.000000
>>>> > 2.115000 (  2.116000)
>>>> >                                              user     system
>>>> > total        real
>>>> >
>>>> > You can see the degradation is pretty bad.
>>>> >
>>>> > I'm concerned because I had hoped that invokedynamic + mutable call
>>>> > site + constant handle would always be faster than a field
>>>> > access...since it avoids excessive field accesses and makes it
>>>> > possible for Hotspot to fold those constants away. What's going on
>>>> > here?
>>>>
>>>> I looked into this and the main issue here is an old friend:  slow
>>>> invokes of non-inlined MH call sites.  The problem is that you trade a
>>>> normal invoke (to a field load?) with a MH invoke.  If the normal invoke
>>>> doesn't get inlined we're good but if the MH invoke doesn't get inlined
>>>> we're screwed (since we are still doing the C2I-I2C dance).
>>>>
>>>> I refactored the benchmark a little (moved stack and heap loops into its
>>>> own methods and only do 5 while-loops instead of 11; that inlines all calls
>>>> in that method) and the performance is like you had expected (a little
>>>> faster):
>>>>
>>>> 32-bit:
>>>>
>>>> before:
>>>>
>>>> 1m x10 while (a)..(!a) (stack)            0.214000   0.000000   0.214000
>>>> (  0.214000)
>>>> 1m x10 while (a)..(!a) (heap)             0.249000   0.000000   0.249000
>>>> (  0.250000)
>>>>
>>>> after:
>>>>
>>>> 1m x10 while (a)..(!a) (stack)            0.203000   0.000000   0.203000
>>>> (  0.203000)
>>>> 1m x10 while (a)..(!a) (heap)             0.234000   0.000000   0.234000
>>>> (  0.234000)
>>>>
>>>> 64-bit:
>>>>
>>>> before:
>>>>
>>>> 1m x10 while (a)..(!a) (stack)            0.248000   0.000000   0.248000
>>>> (  0.248000)
>>>> 1m x10 while (a)..(!a) (heap)             0.257000   0.000000   0.257000
>>>> (  0.257000)
>>>>
>>>> after:
>>>>
>>>> 1m x10 while (a)..(!a) (stack)            0.226000   0.000000   0.226000
>>>> (  0.226000)
>>>> 1m x10 while (a)..(!a) (heap)             0.244000   0.000000   0.244000
>>>> (  0.244000)
>>>>
>>>> We have to fix that but I'm not sure yet what's the best approach.
>>>>  Sorry I don't have better news for now.
>>>>
>>>> -- Chris
>>>>
>>>> >
>>>> > Patch for the change (apply to JRuby master) is here:
>>>> > https://gist.github.com/955976b52b0c4e3f611e
>>>> >
>>>> > - Charlie
>>>> > _______________________________________________
>>>> > mlvm-dev mailing list
>>>> > mlvm-dev@openjdk.java.net
>>>> > http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>>>>
>>>> _______________________________________________
>>>> mlvm-dev mailing list
>>>> mlvm-dev@openjdk.java.net
>>>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>>>
>>> _______________________________________________
>>> mlvm-dev mailing list
>>> mlvm-dev@openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>>>
>>>
>>> _______________________________________________
>>> mlvm-dev mailing list
>>> mlvm-dev@openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>>>
> _______________________________________________
> mlvm-dev mailing list
> mlvm-dev@openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>
>
> _______________________________________________
> mlvm-dev mailing list
> mlvm-dev@openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
>
>
_______________________________________________
mlvm-dev mailing list
mlvm-dev@openjdk.java.net
http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev

Re: MutableCallSite + constant handle slower than field accesses?

Reply via email to