On Apr 28, 2011, at 3:56 PM, Charles Oliver Nutter wrote: > On Thu, Apr 28, 2011 at 8:19 AM, Charles Oliver Nutter > <head...@headius.com> wrote: >> I've been trying to think of ways to reduce the guard cost, since the >> perf without the JRuby guard is a fair bit better (0.79 versus 0.63s >> for fib(35)). The performance without guards is actually faster than >> any other Ruby implementation I've yet run. One idea: > > Now for a harder question... > > Any thoughts on how we can make this even faster? The bulk of the code > seems to be taken up by a few operations inherent to Fixnum math: > > * Memory accesses relating to CallSite subclasses (LtCallSite and friends) > * instanceof checks in those math-related CallSites > * Fixnum overflow checks in + and - operations > * Fixnum allocation/initialization costs (or Fixnum cache accesses) > > As it stands today, the overhead of Fixnum operations is the primary > factor preventing us from writing a lot more of JRuby's code in Ruby. > Fixnums are too expensive to use for iterating over an array, doing a > loop, etc. Of course we could do some code analysis to try to reduce > loops to simple int operations, but barring that...does anyone have > suggestions for reducing the cost of actual Fixnum operations?
Sorry, that's not my area :-) > > Also...is EA working with indy now? No. EA is turned off at invokedynamic call sites. > Unfortunately Fixnum construction > does not fully inline at the moment, since there's too many frames to > get through the constructor chain: > > @ 48 org.jruby.runtime.callsite.MinusCallSite::call (67 bytes) > @ 11 org.jruby.Ruby::isFixnumReopened (5 bytes) > @ 24 org.jruby.RubyFixnum::op_minus (38 bytes) > @ 15 org.jruby.RubyFixnum::subtractionOverflowed (31 bytes) > @ 24 org.jruby.RubyFixnum::subtractAsBignum never executed > @ 29 org.jruby.runtime.ThreadContext::getRuntime (5 bytes) > @ 34 org.jruby.RubyFixnum::newFixnum (29 bytes) > @ 1 org.jruby.RubyFixnum::isInCacheRange (22 bytes) > @ 25 org.jruby.RubyFixnum::<init> (14 bytes) > @ 2 org.jruby.Ruby::getFixnum (5 bytes) > @ 5 org.jruby.RubyInteger::<init> (6 bytes) > @ 2 org.jruby.RubyNumeric::<init> (6 bytes) > @ 2 org.jruby.RubyObject::<init> (6 bytes) > @ 2 org.jruby.RubyBasicObject::<init> (17 bytes) > @ 1 java.lang.Object::<init> inlining too deep > > This is in the inlined fib_ruby and could be the reason why reducing > recursion inlining to 0 improves performance in some cases (but not > fib?!)...i.e. the Fixnum creation in response to a "minus" operation > is 8 frames, so there's only one frame to spare before we're over the > default 9 call inlining limit. Since six of those frames are just the > RubyFixnum constructor chain, I don't have a lot of wiggle room here. Indeed. (Btw. note the email I just sent to hotspot-compiler-dev about MaxRecursiveInlineLevel, it cheats on you.) > > Of course I'd love to see the max inline level bumped up...this isn't > an absurdly deep hierarchy, but EA fails immediately in an inlined > body. But increasing the MaxInlineLevel to e.g. 15 (at which all calls are inlined) doesn't give me better performance (the numbers are without the hack): $ bin/jruby.sh --server -Xcompile.invokedynamic=true bench/bench_fib_recursive.rb 10 35 0.915000 0.000000 0.915000 ( 0.882000) 0.793000 0.000000 0.793000 ( 0.793000) 0.789000 0.000000 0.789000 ( 0.789000) 0.788000 0.000000 0.788000 ( 0.788000) 0.789000 0.000000 0.789000 ( 0.789000) 0.789000 0.000000 0.789000 ( 0.789000) 0.789000 0.000000 0.789000 ( 0.789000) 0.790000 0.000000 0.790000 ( 0.789000) 0.791000 0.000000 0.791000 ( 0.791000) 0.799000 0.000000 0.799000 ( 0.799000) $ bin/jruby.sh --server -Xcompile.invokedynamic=true -J-XX:MaxInlineLevel=15 bench/bench_fib_recursive.rb 10 35 0.912000 0.000000 0.912000 ( 0.881000) 0.792000 0.000000 0.792000 ( 0.792000) 0.788000 0.000000 0.788000 ( 0.788000) 0.792000 0.000000 0.792000 ( 0.792000) 0.793000 0.000000 0.793000 ( 0.793000) 0.791000 0.000000 0.791000 ( 0.791000) 0.787000 0.000000 0.787000 ( 0.787000) 0.788000 0.000000 0.788000 ( 0.788000) 0.789000 0.000000 0.789000 ( 0.789000) 0.801000 0.000000 0.801000 ( 0.801000) I think the current MaxInlineLevel is a good trade-off. > > > Deja vu...have I asked this before? :) > > Then again I may be defeating EA already by using a Fixnum cache, but > disabling that cache entirely impacts performance of small Fixnums > significantly. > > FWIW, here's comparative performance of indy JRuby fib (without your > call site check hack, obviously) versus a pure-Java version of fib > that also uses RubyFixnum operations but virtual instead of dynamic > dispatch: > > ~/projects/jruby ➔ jruby --server -Xcompile.invokedynamic=true > -J-XX:MaxInlineSize=150 -J-XX:InlineSmallCode=3000 > bench/bench_fib_recursive.rb 5 35 > 9227465 > 1.002000 0.000000 1.002000 ( 0.938000) > 9227465 > 0.788000 0.000000 0.788000 ( 0.787000) > 9227465 > 0.796000 0.000000 0.796000 ( 0.796000) > 9227465 > 0.785000 0.000000 0.785000 ( 0.785000) > 9227465 > 0.785000 0.000000 0.785000 ( 0.785000) > > ~/projects/jruby ➔ java -cp lib/jruby.jar:build/classes/test > org.jruby.test.bench.BenchFixnumFibRecursive > Took 452ms for boxedFib(35) = 9227465 > Took 391ms for boxedFib(35) = 9227465 > Took 383ms for boxedFib(35) = 9227465 > Took 381ms for boxedFib(35) = 9227465 > Took 383ms for boxedFib(35) = 9227465 > > So for this particular case, JRuby + indy is performing just over 2x > slower than Java would. > > I've included (truncated) assembly output for 32-bit JVM optimizing > the Java version here: https://gist.github.com/946382 > > Obviously the dyncall guards are gone as are any JRuby runtime-related > memory accesses, but I imagine there's also a higher potential for > Fixnum objects to EA away. Naturally I'd love to get JRuby to perform > as fast as Java, so I'll continue exploring ways to reduce or remove > extra overhead in the JRuby version :) > > BTW, a note on JRuby test failures running indy... (i.e. ATTN REMI) > > I'm having some trouble with JRuby's compiler and ASM failing to emit > valid stack maps. There are some compilation scenarios in JRuby that > may be exposing a bug in ASM's stack map calculation. If I emit Java > 1.5 compatible bytecode for those scenarios and let the map be > calculated during verification, the code loads and executes fine. If I > switch to 1.6 bytecode, I get verification errors saying that the > stack map is invalid. Could be an ASM bug? > > With indy working really well now, I'm going to be working toward > turning it on by default in JRuby, and that will require me to get > test runs green. This is the main problem standing in my way. That is great! I'd love to see everything PASS... -- Christian _______________________________________________ mlvm-dev mailing list mlvm-dev@openjdk.java.net http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev