On Oct 15, 2011, at 2:56 PM, Charles Oliver Nutter wrote: > I'm seeing something peculiar and wanted to run it by you folks. > > There are a few values that JRuby's compiler had previously been > loading from instance fields every time they're needed. Specifically, > fields like ThreadContext.runtime (the current JRuby runtime), > Ruby.falseObject, Ruby.trueObject, Ruby.nilObject (false, true, and > nil values). I figured I'd make a quick change today and have those > instead be constant method handles bound into a mutable call site. > > Unfortunately, performance seems to be worse. > > The logic works like this: > > * ThreadContext is loaded to stack > * invokedynamic, bootstrap just wires up an initialization method into > a MutableCallSite > * initialization method rebinds call site forever to a constant method > handle pointing at the value (runtime/true/false/nil objects) > > My expectation was that this would be at least no slower (and > potentially a tiny bit faster) but also less bytecode (in the case of > true/false/nil, it was previously doing > ThreadContext.runtime.getNil()/getTrue()/getFalse()). It seems like > it's actually slower than walking those references, though, and I'm > not sure why. > > Here's a couple of the scenarios in diff form showing bytecode before > and bytecode after: > > Loading "runtime" > > ALOAD 1 > - GETFIELD org/jruby/runtime/ThreadContext.runtime : Lorg/jruby/Ruby; > + INVOKEDYNAMIC getRuntime > (Lorg/jruby/runtime/ThreadContext;)Lorg/jruby/Ruby; > [org/jruby/runtime/invokedynamic/InvokeDynamicSupport.getObjectBootstrap(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/St > ring;Ljava/lang/invoke/MethodType;)Ljava/lang/invoke/CallSite; (6)] > > Loading "false" > > ALOAD 1 > - GETFIELD org/jruby/runtime/ThreadContext.runtime : Lorg/jruby/Ruby; > - INVOKEVIRTUAL org/jruby/Ruby.getFalse ()Lorg/jruby/RubyBoolean; > + INVOKEDYNAMIC getFalse > (Lorg/jruby/runtime/ThreadContext;)Lorg/jruby/RubyBoolean; > [org/jruby/runtime/invokedynamic/InvokeDynamicSupport.getObjectBootstrap(Ljava/lang/invoke/MethodHandles$Lookup;Ljava/lang/String;Ljava/lang/invoke/MethodType;)Ljava/lang/invoke/CallSite; > (6)] > > I think because these are now seen as invocations, I'm hitting some > inlining budget limit I didn't hit before (and which isn't being > properly discounted). The benchmark I'm seeing degrade is > bench/language/bench_flip.rb, and it's a pretty significant > degradation. Only the "heap" version shows the degradation, and it > definitely does have more bytecode...but the bytecode with my patch > differs only in the way these values are being accessed, as shown in > the diffs above. > > Before: > user system > total real > 1m x10 while (a)..(!a) (heap) 0.951000 0.000000 > 0.951000 ( 0.910000) > user system > total real > 1m x10 while (a)..(!a) (heap) 0.705000 0.000000 > 0.705000 ( 0.705000) > user system > total real > 1m x10 while (a)..(!a) (heap) 0.688000 0.000000 > 0.688000 ( 0.688000) > user system > total real > > After: > user system > total real > 1m x10 while (a)..(!a) (heap) 2.350000 0.000000 > 2.350000 ( 2.284000) > user system > total real > 1m x10 while (a)..(!a) (heap) 2.128000 0.000000 > 2.128000 ( 2.128000) > user system > total real > 1m x10 while (a)..(!a) (heap) 2.115000 0.000000 > 2.115000 ( 2.116000) > user system > total real > > You can see the degradation is pretty bad. > > I'm concerned because I had hoped that invokedynamic + mutable call > site + constant handle would always be faster than a field > access...since it avoids excessive field accesses and makes it > possible for Hotspot to fold those constants away. What's going on > here?
I looked into this and the main issue here is an old friend: slow invokes of non-inlined MH call sites. The problem is that you trade a normal invoke (to a field load?) with a MH invoke. If the normal invoke doesn't get inlined we're good but if the MH invoke doesn't get inlined we're screwed (since we are still doing the C2I-I2C dance). I refactored the benchmark a little (moved stack and heap loops into its own methods and only do 5 while-loops instead of 11; that inlines all calls in that method) and the performance is like you had expected (a little faster): 32-bit: before: 1m x10 while (a)..(!a) (stack) 0.214000 0.000000 0.214000 ( 0.214000) 1m x10 while (a)..(!a) (heap) 0.249000 0.000000 0.249000 ( 0.250000) after: 1m x10 while (a)..(!a) (stack) 0.203000 0.000000 0.203000 ( 0.203000) 1m x10 while (a)..(!a) (heap) 0.234000 0.000000 0.234000 ( 0.234000) 64-bit: before: 1m x10 while (a)..(!a) (stack) 0.248000 0.000000 0.248000 ( 0.248000) 1m x10 while (a)..(!a) (heap) 0.257000 0.000000 0.257000 ( 0.257000) after: 1m x10 while (a)..(!a) (stack) 0.226000 0.000000 0.226000 ( 0.226000) 1m x10 while (a)..(!a) (heap) 0.244000 0.000000 0.244000 ( 0.244000) We have to fix that but I'm not sure yet what's the best approach. Sorry I don't have better news for now. -- Chris > > Patch for the change (apply to JRuby master) is here: > https://gist.github.com/955976b52b0c4e3f611e > > - Charlie > _______________________________________________ > mlvm-dev mailing list > mlvm-dev@openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev _______________________________________________ mlvm-dev mailing list mlvm-dev@openjdk.java.net http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev