Now for something completely different: SwitchPoint-based "constant" lookup in JRuby.
It's certainly possible I'm doing something wrong here, but using a SwitchPoint for constant invalidation in JRuby (rather than pinging a global serial number) is significantly slower. Using SwitchPoint: ~/projects/jruby ➔ jruby -J-d64 --server bench/language/bench_const_lookup.rb 10 user system total real 100k * 100 nested const get 1.342000 0.000000 1.342000 ( 1.286000) 100k * 100 nested const get 1.030000 0.000000 1.030000 ( 1.030000) 100k * 100 nested const get 1.131000 0.000000 1.131000 ( 1.131000) 100k * 100 nested const get 1.085000 0.000000 1.085000 ( 1.085000) 100k * 100 nested const get 1.019000 0.000000 1.019000 ( 1.019000) 100k * 100 inherited const get 1.230000 0.000000 1.230000 ( 1.230000) 100k * 100 inherited const get 0.989000 0.000000 0.989000 ( 0.989000) 100k * 100 inherited const get 0.981000 0.000000 0.981000 ( 0.981000) 100k * 100 inherited const get 0.988000 0.000000 0.988000 ( 0.988000) 100k * 100 inherited const get 1.025000 0.000000 1.025000 ( 1.025000) 100k * 100 both 1.206000 0.000000 1.206000 ( 1.206000) 100k * 100 both 0.992000 0.000000 0.992000 ( 0.992000) 100k * 100 both 0.989000 0.000000 0.989000 ( 0.989000) 100k * 100 both 1.000000 0.000000 1.000000 ( 1.000000) 100k * 100 both 1.003000 0.000000 1.003000 ( 1.003000) Using a global serial number ping: 100k * 100 nested const get 0.082000 0.000000 0.082000 ( 0.082000) 100k * 100 nested const get 0.088000 0.000000 0.088000 ( 0.087000) 100k * 100 nested const get 0.082000 0.000000 0.082000 ( 0.082000) 100k * 100 nested const get 0.082000 0.000000 0.082000 ( 0.082000) 100k * 100 nested const get 0.082000 0.000000 0.082000 ( 0.082000) 100k * 100 inherited const get 0.084000 0.000000 0.084000 ( 0.084000) 100k * 100 inherited const get 0.085000 0.000000 0.085000 ( 0.085000) 100k * 100 inherited const get 0.083000 0.000000 0.083000 ( 0.083000) 100k * 100 inherited const get 0.083000 0.000000 0.083000 ( 0.083000) 100k * 100 inherited const get 0.083000 0.000000 0.083000 ( 0.083000) 100k * 100 both 0.096000 0.000000 0.096000 ( 0.096000) 100k * 100 both 0.097000 0.000000 0.097000 ( 0.097000) 100k * 100 both 0.105000 0.000000 0.105000 ( 0.105000) 100k * 100 both 0.097000 0.000000 0.097000 ( 0.097000) 100k * 100 both 0.086000 0.000000 0.086000 ( 0.086000) Perhaps SwitchPoint has not had optimization love yet? FWIW, SwitchPoint doesn't even work in the macosx 5/13 build (which I *think* is b141), so there's nothing to compare it to (i.e. I don't consider this a regression...just slow). I can investigate this further on demand. - Charlie On Thu, May 26, 2011 at 1:34 AM, Charles Oliver Nutter <head...@headius.com> wrote: > Ok, here we go with the macosx build from 5/13. Performance is > *substantially* better. > > First tak: > > user system total real > 1.401000 0.000000 1.401000 ( 0.821000) > 0.552000 0.000000 0.552000 ( 0.552000) > 0.561000 0.000000 0.561000 ( 0.561000) > 0.552000 0.000000 0.552000 ( 0.552000) > 0.553000 0.000000 0.553000 ( 0.553000) > > Same JRuby logic, earlier build, 2-4x faster than current MLVM invokedynamic. > > Now fib: > > 9227465 > 0.979000 0.000000 0.979000 ( 0.922000) > 9227465 > 0.848000 0.000000 0.848000 ( 0.848000) > 9227465 > 0.796000 0.000000 0.796000 ( 0.796000) > 9227465 > 0.792000 0.000000 0.792000 ( 0.792000) > 9227465 > 0.786000 0.000000 0.786000 ( 0.787000) > > The margin is not as great here, but it's easily 20% faster than even > the reverted GWT (no idea bout the new GWT logic yet). > > I can provide assembly dumps and other logs from both builds on > request. Where shall we start? > > Disclaimer: I know optimizing for simple cases like fib and tak is not > a great idea, but it seems like if we can't make them fast we're going > to have trouble with a lot of other stuff. I will endeavor to get > numbers for less synthetic benchmarks too. > > - Charlie > > On Thu, May 26, 2011 at 12:33 AM, Charles Oliver Nutter > <head...@headius.com> wrote: >> Ok, onward with perf exploration, folks! >> >> I'm running with mostly-current MLVM, with John's temporary reversion >> of GWT to the older non-ricochet logic. >> >> As reported before, "fib" has improved with the reversion, but it's >> only marginally faster than JRuby's inline caching logic and easily >> 30-40% slower than it was in builds from earlier this month. >> >> I also decided to run "tak", which is another dispatch and >> recursion-heavy benchmark. This still seems to have a perf >> degradation. >> >> Here's with standard settings, current MLVM, amd64: >> >> >> ~/projects/jruby ➔ jruby --server bench/bench_tak.rb 5 >> user system total real >> 2.443000 0.000000 2.443000 ( 2.383000) >> 1.985000 0.000000 1.985000 ( 1.985000) >> 2.007000 0.000000 2.007000 ( 2.007000) >> 1.987000 0.000000 1.987000 ( 1.987000) >> 1.991000 0.000000 1.991000 ( 1.991000) >> >> Here is with JRuby's inline caching. Given that tak is an arity three >> method, it's likely that the usually megamorphic inline cache is still >> monomorphic, so things are inlining through it when they wouldn't >> normally: >> >> ~/projects/jruby ➔ jruby --server -Xcompile.invokedynamic=false >> bench/bench_tak.rb 5 >> user system total real >> 1.565000 0.000000 1.565000 ( 1.510000) >> 0.624000 0.000000 0.624000 ( 0.624000) >> 0.624000 0.000000 0.624000 ( 0.624000) >> 0.624000 0.000000 0.624000 ( 0.624000) >> 0.632000 0.000000 0.632000 ( 0.632000) >> >> Oddly enough, modifying the benchmark to guarantee there's at least >> three different method calls of arity 3 does not appear to degrade >> this benchmark... >> >> Moving on to dynopt (reminder: this emits two invocations at compile >> time, one a guarded invokevirtual or invokestatic and the other a >> normal CachingCallSite.call): >> >> ~/projects/jruby ➔ jruby --server -Xcompile.invokedynamic=false >> -Xcompile.dynopt=true bench/bench_tak.rb 5 >> user system total real >> 0.703000 0.000000 0.703000 ( 0.630000) >> 0.514000 0.000000 0.514000 ( 0.514000) >> 0.511000 0.000000 0.511000 ( 0.511000) >> 0.512000 0.000000 0.512000 ( 0.512000) >> 0.510000 0.000000 0.510000 ( 0.510000) >> >> This is the "ideal" for invokedynamic, which hopefully should inline >> as well as this guarded direct invocation (right?). >> >> Now, it gets a bit more interesting. If I turn recursive inlining down >> to zero and use invokedynamic: >> >> ~/projects/jruby ➔ jruby --server -J-XX:MaxRecursiveInlineLevel=0 >> bench/bench_tak.rb 5 >> user system total real >> 1.010000 0.000000 1.010000 ( 0.954000) >> 0.869000 0.000000 0.869000 ( 0.869000) >> 0.870000 0.000000 0.870000 ( 0.870000) >> 0.869000 0.000000 0.869000 ( 0.869000) >> 0.870000 0.000000 0.870000 ( 0.870000) >> >> Performance is easily 2x what it is with stock inlining settings. >> Something about invokedynamic or the MH chain is changing the >> characteristics of inlining in a way different from dynopt. >> >> So what looks interesting here? For which combination would you be >> interested in seeing logs? >> >> FWIW, I am pulling earlier builds now to try out fib and tak and get >> assembly output from them. >> >> - Charlie >> > _______________________________________________ mlvm-dev mailing list mlvm-dev@openjdk.java.net http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev