On Thu, Apr 28, 2011 at 8:19 AM, Charles Oliver Nutter <head...@headius.com> wrote: > I've been trying to think of ways to reduce the guard cost, since the > perf without the JRuby guard is a fair bit better (0.79 versus 0.63s > for fib(35)). The performance without guards is actually faster than > any other Ruby implementation I've yet run. One idea:
Now for a harder question... Any thoughts on how we can make this even faster? The bulk of the code seems to be taken up by a few operations inherent to Fixnum math: * Memory accesses relating to CallSite subclasses (LtCallSite and friends) * instanceof checks in those math-related CallSites * Fixnum overflow checks in + and - operations * Fixnum allocation/initialization costs (or Fixnum cache accesses) As it stands today, the overhead of Fixnum operations is the primary factor preventing us from writing a lot more of JRuby's code in Ruby. Fixnums are too expensive to use for iterating over an array, doing a loop, etc. Of course we could do some code analysis to try to reduce loops to simple int operations, but barring that...does anyone have suggestions for reducing the cost of actual Fixnum operations? Also...is EA working with indy now? Unfortunately Fixnum construction does not fully inline at the moment, since there's too many frames to get through the constructor chain: @ 48 org.jruby.runtime.callsite.MinusCallSite::call (67 bytes) @ 11 org.jruby.Ruby::isFixnumReopened (5 bytes) @ 24 org.jruby.RubyFixnum::op_minus (38 bytes) @ 15 org.jruby.RubyFixnum::subtractionOverflowed (31 bytes) @ 24 org.jruby.RubyFixnum::subtractAsBignum never executed @ 29 org.jruby.runtime.ThreadContext::getRuntime (5 bytes) @ 34 org.jruby.RubyFixnum::newFixnum (29 bytes) @ 1 org.jruby.RubyFixnum::isInCacheRange (22 bytes) @ 25 org.jruby.RubyFixnum::<init> (14 bytes) @ 2 org.jruby.Ruby::getFixnum (5 bytes) @ 5 org.jruby.RubyInteger::<init> (6 bytes) @ 2 org.jruby.RubyNumeric::<init> (6 bytes) @ 2 org.jruby.RubyObject::<init> (6 bytes) @ 2 org.jruby.RubyBasicObject::<init> (17 bytes) @ 1 java.lang.Object::<init> inlining too deep This is in the inlined fib_ruby and could be the reason why reducing recursion inlining to 0 improves performance in some cases (but not fib?!)...i.e. the Fixnum creation in response to a "minus" operation is 8 frames, so there's only one frame to spare before we're over the default 9 call inlining limit. Since six of those frames are just the RubyFixnum constructor chain, I don't have a lot of wiggle room here. Of course I'd love to see the max inline level bumped up...this isn't an absurdly deep hierarchy, but EA fails immediately in an inlined body. Deja vu...have I asked this before? :) Then again I may be defeating EA already by using a Fixnum cache, but disabling that cache entirely impacts performance of small Fixnums significantly. FWIW, here's comparative performance of indy JRuby fib (without your call site check hack, obviously) versus a pure-Java version of fib that also uses RubyFixnum operations but virtual instead of dynamic dispatch: ~/projects/jruby ➔ jruby --server -Xcompile.invokedynamic=true -J-XX:MaxInlineSize=150 -J-XX:InlineSmallCode=3000 bench/bench_fib_recursive.rb 5 35 9227465 1.002000 0.000000 1.002000 ( 0.938000) 9227465 0.788000 0.000000 0.788000 ( 0.787000) 9227465 0.796000 0.000000 0.796000 ( 0.796000) 9227465 0.785000 0.000000 0.785000 ( 0.785000) 9227465 0.785000 0.000000 0.785000 ( 0.785000) ~/projects/jruby ➔ java -cp lib/jruby.jar:build/classes/test org.jruby.test.bench.BenchFixnumFibRecursive Took 452ms for boxedFib(35) = 9227465 Took 391ms for boxedFib(35) = 9227465 Took 383ms for boxedFib(35) = 9227465 Took 381ms for boxedFib(35) = 9227465 Took 383ms for boxedFib(35) = 9227465 So for this particular case, JRuby + indy is performing just over 2x slower than Java would. I've included (truncated) assembly output for 32-bit JVM optimizing the Java version here: https://gist.github.com/946382 Obviously the dyncall guards are gone as are any JRuby runtime-related memory accesses, but I imagine there's also a higher potential for Fixnum objects to EA away. Naturally I'd love to get JRuby to perform as fast as Java, so I'll continue exploring ways to reduce or remove extra overhead in the JRuby version :) BTW, a note on JRuby test failures running indy... (i.e. ATTN REMI) I'm having some trouble with JRuby's compiler and ASM failing to emit valid stack maps. There are some compilation scenarios in JRuby that may be exposing a bug in ASM's stack map calculation. If I emit Java 1.5 compatible bytecode for those scenarios and let the map be calculated during verification, the code loads and executes fine. If I switch to 1.6 bytecode, I get verification errors saying that the stack map is invalid. Could be an ASM bug? With indy working really well now, I'm going to be working toward turning it on by default in JRuby, and that will require me to get test runs green. This is the main problem standing in my way. - Charlie _______________________________________________ mlvm-dev mailing list mlvm-dev@openjdk.java.net http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev