Oh, there's also this peculiar effect...shouldn't -TieredCompilation just give me C2 alone?
system ~/projects/jruby $ jruby -v -J-XX:-TieredCompilation ../rubybench/bench/time/bench_red_black.rb jruby 1.7.2.dev (1.9.3p327) 2012-12-22 51cc3ad on OpenJDK 64-Bit Server VM 1.8.0-internal-headius_2012_12_23_22_29-b00 +indy [darwin-x86_64] 9.191 1.923 1.429 1.183 1.226 1.237 1.211 1.284 1.267 1.223 system ~/projects/jruby $ jruby -v ../rubybench/bench/time/bench_red_black.rb jruby 1.7.2.dev (1.9.3p327) 2012-12-22 51cc3ad on OpenJDK 64-Bit Server VM 1.8.0-internal-headius_2012_12_23_22_29-b00 +indy [darwin-x86_64] 4.58 1.421 0.912 0.922 0.835 0.83 0.891 0.816 0.825 0.853 And here's those Java 7 numbers. I guess it's not as close as what I posted previously, but it's still a lot better: system ~/projects/jruby $ (pickjdk 5; jruby -v -Xcompile.invokedynamic=true ../rubybench/bench/time/bench_red_black.rb ) New JDK: jdk1.7.0_09.jdk jruby 1.7.2.dev (1.9.3p327) 2012-12-22 51cc3ad on Java HotSpot(TM) 64-Bit Server VM 1.7.0_09-b05 +indy [darwin-x86_64] 3.105 1.595 1.182 0.825 1.751 0.794 0.756 0.746 0.702 0.777 - Charlie On Sun, Dec 23, 2012 at 11:56 PM, Charles Oliver Nutter <head...@headius.com> wrote: > Ok, things are definitely looking up with Roland's and Christian's patches! > > Numbers for red/black get as low as 0.74s with the new logic instead > of the 1.5s I get without the patches, and compared to the old logic's > best time of 0.726. Both results are rather variable (maybe as much as > 15%) due to the amount of allocation and GC happening. So it's not > quite at the level of the old logic, but it's darn close. > > However, here's a benchmark that's still considerably slower than on > the Java 7 impl: https://gist.github.com/4367878 > > This requires the "perfer" gem (gem install perfer) and should be > level between the "static" and "included" versions. The overall loop > should be a lot faster too. > > Numbers for Java 7u9 are in the gist. Numbers for current hotspot-comp > + Christian's patch: > > system ~/projects/jruby $ jruby -Xcompile.invokedynamic=true > ../jruby/static_versus_include_bench.rb > Session Static versus included method invocation with jruby 1.7.2.dev > (1.9.3p327) 2012-12-22 51cc3ad on OpenJDK 64-Bit Server VM > 1.8.0-internal-headius_2012_12_23_22_29-b00 +indy [darwin-x86_64] > Taking 10 measurements of at least 1.0s > control loop 10.99 ns/i ± 1.304 (11.9%) <=> 90938318 ips > static invocation 17.65 ns/i ± 1.380 ( 7.8%) <=> 56658156 ips > included invocation 11.15 ns/i ± 3.132 (28.1%) <=> 89630324 ips > > The static case (Foo.foo) basically boils down to a SwitchPoint + > cached value for Foo and then SwitchPoint + GWT + field read + > reference comparison for the call. The included case is just the > latter, so this seems to indicate that the SwitchPoint for the Foo > lookup is adding more overhead than it should. I have not dug any > deeper, so I'm tossing this out there. > > Will try to get some logging for the benchmark tomorrow. > > - Charlie > > On Sun, Dec 23, 2012 at 10:26 PM, Charles Oliver Nutter > <head...@headius.com> wrote: >> Excellent! I'll give it a look and base my experiments on that! >> >> - Charlie >> >> On Sun, Dec 23, 2012 at 4:04 PM, Vladimir Kozlov >> <vladimir.koz...@oracle.com> wrote: >>> Hi Charlie, >>> >>> If you want to experiment :) you can try the code Roland and Christian >>> pushed. >>> >>> Roland just pushed Incremental inlining changes for C2 which should help >>> LF inlining: >>> >>> http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot/rev/d092d1b31229 >>> >>> You also need Christian's inlining related changes in JDK which : >>> >>> http://hg.openjdk.java.net/hsx/hotspot-main/jdk/rev/12fa4d7ecaf5 >>> >>> Regards, >>> Vladimir >>> >>> On 12/23/12 11:21 AM, Charles Oliver Nutter wrote: >>>> A thread emerges! >>>> >>>> I'm going to be taking some time this holiday to explore the >>>> performance of the new LF indy impl in various situations. This will >>>> be the thread where I gather observations. >>>> >>>> A couple preliminaries... >>>> >>>> My perf exploration so far seems to show LF performing nearly >>>> equivalent to the old impl for the smallest benchmarks, with >>>> performance rapidly degrading as the size of the code involved grows. >>>> Recursive fib and tak have nearly identical perf on LF and the old >>>> impl. Red/black performs about the same on LF as with indy disabled, >>>> well behind the old indy performance. At some point, LF falls >>>> completely off the cliff and can't even compete with non-indy logic, >>>> as in a benchmark I ran today of Ruby constant access (heavily >>>> SwitchPoint-dependent). >>>> >>>> Discussions with Christian seem to indicate that the fall-off is >>>> because non-inlined LF indy call sites perform very poorly compared to >>>> the old impl. I'll be trying to explore this and correlate the perf >>>> cliff with failure to inline. Christian has told me that (upcoming?) >>>> work on incremental inlining will help reduce the performance impact >>>> of the fall-off, but I'm not sure of the status of this work. >>>> >>>> Some early ASM output from a trivial benchmark: loop 500M times >>>> calling #foo, which immediately calls #bar, which just returns the >>>> self object (ALOAD 2; ARETURN in essence). I've been comparing the new >>>> ASM to the old, both presented in a gist here: >>>> https://gist.github.com/4365103 >>>> >>>> As you can see, the code resulting from both impls boils down to >>>> almost nothing, but there's one difference... >>>> >>>> New code not present in old: >>>> >>>> 0x0000000111ab27ef: je 0x0000000111ab2835 ;*ifnull >>>> ; - >>>> java.lang.Class::cast@1 (line 3007) >>>> ; - >>>> java.lang.invoke.LambdaForm$MH/763053631::guard@12 >>>> ; - >>>> java.lang.invoke.LambdaForm$MH/518216626::linkToCallSite@14 >>>> ; - >>>> ruby.__dash_e__::method__0$RUBY$foo@3 (line 1) >>>> >>>> A side effect of inlining through LFs, I presume? Checking to ensure >>>> non-null call site? If so, shouldn't this have folded away, since the >>>> call site is constant? >>>> >>>> In any case, it's hardly damning to have an extra branch. This output >>>> is, at least, proof that LF *can* inline and optimize as well as the >>>> old impl...so we can put that aside for now. The questions to explore >>>> then are: >>>> >>>> * Do cases expected to inline actually do so under LF impl? >>>> * When inlining, does code optimize as it should (across the various >>>> shapes of call sites in JRuby, at least)? >>>> * When code does not inline, how does it impact performance? >>>> >>>> My expectation is that cases which should inline do so under LF, but >>>> that the non-inlined performance is significantly worse than under the >>>> old impl. The critical bit will be ensuring that even when LF call >>>> sites do not inline, they at least still compile to avoid >>>> interpretation and LF-to-LF overhead. At a minimum, it seems like we >>>> should be able to expect all LF between a call site and its DMH target >>>> will get compiled into a single unit, if not inlined into the caller. >>>> I still contend that call site + LFs should be heavily prioritized for >>>> inlining either into the caller or along with the called method, since >>>> they really *are* the shape of the call site. If there has to be a >>>> callq somewhere in that chain, there should ideally be only one. >>>> >>>> So...here we go. >>>> >>>> - Charlie >>>> _______________________________________________ >>>> mlvm-dev mailing list >>>> mlvm-dev@openjdk.java.net >>>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev >>>> >>> _______________________________________________ >>> mlvm-dev mailing list >>> mlvm-dev@openjdk.java.net >>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev _______________________________________________ mlvm-dev mailing list mlvm-dev@openjdk.java.net http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev