On Dec 23, 2012, at 10:11 PM, Charles Oliver Nutter <head...@headius.com> wrote:
> Oh, there's also this peculiar effect...shouldn't -TieredCompilation > just give me C2 alone? Yes, it should. > > system ~/projects/jruby $ jruby -v -J-XX:-TieredCompilation > ../rubybench/bench/time/bench_red_black.rb > jruby 1.7.2.dev (1.9.3p327) 2012-12-22 51cc3ad on OpenJDK 64-Bit > Server VM 1.8.0-internal-headius_2012_12_23_22_29-b00 +indy > [darwin-x86_64] > 9.191 > 1.923 > 1.429 > 1.183 > 1.226 > 1.237 > 1.211 > 1.284 > 1.267 > 1.223 > > system ~/projects/jruby $ jruby -v ../rubybench/bench/time/bench_red_black.rb > jruby 1.7.2.dev (1.9.3p327) 2012-12-22 51cc3ad on OpenJDK 64-Bit > Server VM 1.8.0-internal-headius_2012_12_23_22_29-b00 +indy > [darwin-x86_64] > 4.58 > 1.421 > 0.912 > 0.922 > 0.835 > 0.83 > 0.891 > 0.816 > 0.825 > 0.853 The Nashorn people have seen similar results when using tiered. We haven't investigated yet but I have the feeling that it's related to the huge compile tasks that come out of LFs. Sometimes it's better to already have compiled code for a method rather than inlining it. And with tiered it seems that's what's happening. It could also be related to racing compiles (tiered has more compiler threads and C1 compiles faster). -- Chris > > And here's those Java 7 numbers. I guess it's not as close as what I > posted previously, but it's still a lot better: > > system ~/projects/jruby $ (pickjdk 5; jruby -v > -Xcompile.invokedynamic=true > ../rubybench/bench/time/bench_red_black.rb ) > New JDK: jdk1.7.0_09.jdk > jruby 1.7.2.dev (1.9.3p327) 2012-12-22 51cc3ad on Java HotSpot(TM) > 64-Bit Server VM 1.7.0_09-b05 +indy [darwin-x86_64] > 3.105 > 1.595 > 1.182 > 0.825 > 1.751 > 0.794 > 0.756 > 0.746 > 0.702 > 0.777 > > - Charlie > > On Sun, Dec 23, 2012 at 11:56 PM, Charles Oliver Nutter > <head...@headius.com> wrote: >> Ok, things are definitely looking up with Roland's and Christian's patches! >> >> Numbers for red/black get as low as 0.74s with the new logic instead >> of the 1.5s I get without the patches, and compared to the old logic's >> best time of 0.726. Both results are rather variable (maybe as much as >> 15%) due to the amount of allocation and GC happening. So it's not >> quite at the level of the old logic, but it's darn close. >> >> However, here's a benchmark that's still considerably slower than on >> the Java 7 impl: https://gist.github.com/4367878 >> >> This requires the "perfer" gem (gem install perfer) and should be >> level between the "static" and "included" versions. The overall loop >> should be a lot faster too. >> >> Numbers for Java 7u9 are in the gist. Numbers for current hotspot-comp >> + Christian's patch: >> >> system ~/projects/jruby $ jruby -Xcompile.invokedynamic=true >> ../jruby/static_versus_include_bench.rb >> Session Static versus included method invocation with jruby 1.7.2.dev >> (1.9.3p327) 2012-12-22 51cc3ad on OpenJDK 64-Bit Server VM >> 1.8.0-internal-headius_2012_12_23_22_29-b00 +indy [darwin-x86_64] >> Taking 10 measurements of at least 1.0s >> control loop 10.99 ns/i ± 1.304 (11.9%) <=> 90938318 ips >> static invocation 17.65 ns/i ± 1.380 ( 7.8%) <=> 56658156 ips >> included invocation 11.15 ns/i ± 3.132 (28.1%) <=> 89630324 ips >> >> The static case (Foo.foo) basically boils down to a SwitchPoint + >> cached value for Foo and then SwitchPoint + GWT + field read + >> reference comparison for the call. The included case is just the >> latter, so this seems to indicate that the SwitchPoint for the Foo >> lookup is adding more overhead than it should. I have not dug any >> deeper, so I'm tossing this out there. >> >> Will try to get some logging for the benchmark tomorrow. >> >> - Charlie >> >> On Sun, Dec 23, 2012 at 10:26 PM, Charles Oliver Nutter >> <head...@headius.com> wrote: >>> Excellent! I'll give it a look and base my experiments on that! >>> >>> - Charlie >>> >>> On Sun, Dec 23, 2012 at 4:04 PM, Vladimir Kozlov >>> <vladimir.koz...@oracle.com> wrote: >>>> Hi Charlie, >>>> >>>> If you want to experiment :) you can try the code Roland and Christian >>>> pushed. >>>> >>>> Roland just pushed Incremental inlining changes for C2 which should help >>>> LF inlining: >>>> >>>> http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot/rev/d092d1b31229 >>>> >>>> You also need Christian's inlining related changes in JDK which : >>>> >>>> http://hg.openjdk.java.net/hsx/hotspot-main/jdk/rev/12fa4d7ecaf5 >>>> >>>> Regards, >>>> Vladimir >>>> >>>> On 12/23/12 11:21 AM, Charles Oliver Nutter wrote: >>>>> A thread emerges! >>>>> >>>>> I'm going to be taking some time this holiday to explore the >>>>> performance of the new LF indy impl in various situations. This will >>>>> be the thread where I gather observations. >>>>> >>>>> A couple preliminaries... >>>>> >>>>> My perf exploration so far seems to show LF performing nearly >>>>> equivalent to the old impl for the smallest benchmarks, with >>>>> performance rapidly degrading as the size of the code involved grows. >>>>> Recursive fib and tak have nearly identical perf on LF and the old >>>>> impl. Red/black performs about the same on LF as with indy disabled, >>>>> well behind the old indy performance. At some point, LF falls >>>>> completely off the cliff and can't even compete with non-indy logic, >>>>> as in a benchmark I ran today of Ruby constant access (heavily >>>>> SwitchPoint-dependent). >>>>> >>>>> Discussions with Christian seem to indicate that the fall-off is >>>>> because non-inlined LF indy call sites perform very poorly compared to >>>>> the old impl. I'll be trying to explore this and correlate the perf >>>>> cliff with failure to inline. Christian has told me that (upcoming?) >>>>> work on incremental inlining will help reduce the performance impact >>>>> of the fall-off, but I'm not sure of the status of this work. >>>>> >>>>> Some early ASM output from a trivial benchmark: loop 500M times >>>>> calling #foo, which immediately calls #bar, which just returns the >>>>> self object (ALOAD 2; ARETURN in essence). I've been comparing the new >>>>> ASM to the old, both presented in a gist here: >>>>> https://gist.github.com/4365103 >>>>> >>>>> As you can see, the code resulting from both impls boils down to >>>>> almost nothing, but there's one difference... >>>>> >>>>> New code not present in old: >>>>> >>>>> 0x0000000111ab27ef: je 0x0000000111ab2835 ;*ifnull >>>>> ; - >>>>> java.lang.Class::cast@1 (line 3007) >>>>> ; - >>>>> java.lang.invoke.LambdaForm$MH/763053631::guard@12 >>>>> ; - >>>>> java.lang.invoke.LambdaForm$MH/518216626::linkToCallSite@14 >>>>> ; - >>>>> ruby.__dash_e__::method__0$RUBY$foo@3 (line 1) >>>>> >>>>> A side effect of inlining through LFs, I presume? Checking to ensure >>>>> non-null call site? If so, shouldn't this have folded away, since the >>>>> call site is constant? >>>>> >>>>> In any case, it's hardly damning to have an extra branch. This output >>>>> is, at least, proof that LF *can* inline and optimize as well as the >>>>> old impl...so we can put that aside for now. The questions to explore >>>>> then are: >>>>> >>>>> * Do cases expected to inline actually do so under LF impl? >>>>> * When inlining, does code optimize as it should (across the various >>>>> shapes of call sites in JRuby, at least)? >>>>> * When code does not inline, how does it impact performance? >>>>> >>>>> My expectation is that cases which should inline do so under LF, but >>>>> that the non-inlined performance is significantly worse than under the >>>>> old impl. The critical bit will be ensuring that even when LF call >>>>> sites do not inline, they at least still compile to avoid >>>>> interpretation and LF-to-LF overhead. At a minimum, it seems like we >>>>> should be able to expect all LF between a call site and its DMH target >>>>> will get compiled into a single unit, if not inlined into the caller. >>>>> I still contend that call site + LFs should be heavily prioritized for >>>>> inlining either into the caller or along with the called method, since >>>>> they really *are* the shape of the call site. If there has to be a >>>>> callq somewhere in that chain, there should ideally be only one. >>>>> >>>>> So...here we go. >>>>> >>>>> - Charlie >>>>> _______________________________________________ >>>>> mlvm-dev mailing list >>>>> mlvm-dev@openjdk.java.net >>>>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev >>>>> >>>> _______________________________________________ >>>> mlvm-dev mailing list >>>> mlvm-dev@openjdk.java.net >>>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev > _______________________________________________ > mlvm-dev mailing list > mlvm-dev@openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev _______________________________________________ mlvm-dev mailing list mlvm-dev@openjdk.java.net http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev