Subramanya Sastry wrote:
This note interweaves code optimization with discussion about IR because
the IR choice will be influenced by what you want to do with it.  I am
going to use examples below to guide this discussion.  A lot of it is
somewhat 'stream-of-thought' writing -- this is not a tight spec.

Thanks for jumping in Subbu :)

Here 'call' could be a regular C call for all you care.  So, what would
happen in the compiler is that you would first transform the code into
a higher-level IR (with virtual calls as above), and at some point in the
optimization, you will ratchet down the IR one level lower that might
expose other optimizations such as this one.  If you took care to maintain
the SSA-ness during transformations, optimizations such as eliminating
duplicate method look ups can be done by the SCCP (sparse conditional constant
propagation) algorithm which is linear and pretty fast for all practical
purposes.

Yes, this fits what I was thinking as well. I believe part of our problem in optimizing Ruby is that we only ever represent things at the highest levels, ignoring the fact that there are very important low-level operations (like function retrieval) that should enter into analysis. So in essence, we've been going straight from our high-level, coarse-grained IR (the AST) to the target language (JVM bytecode) with only primitive analysis in between. What we want is to decompose the high-level IR (perhaps after first making it a bit nicer than an AST) into the *actual* component operations, and performing multiple levels of analysis before producing the eventual result. Yes, I see it.

This also enters into what we talked about at the coffee shop, where because MRI is green threaded we have some implicit boundaries at which we must propagate cross-thread changes to global state (like method tables). And that if we represent these operations at an IR level, we may be able to coarsen things like method override checks until the next implicit thread-context boundary. Very nice.

Open classes & JIT => All optimizations have to be guarded
----------------------------------------------------------
...
But, if you are going to JIT Ruby code and enter JVM/HotSpot land, you have
to build in optimization guards throughout the code.  This means the IR has
to incorporate guard instructions. For example, here are few types of guards:

Ahh...yes, this is exactly what I had in mind!

   type_guard(o, Fixnum, L)     <-- if type of o is not Fixnum, jump to L
class_guard(Fixnum, L) <-- if Fixnum has been monkey-patched, jump to L method_guard(Fixnum, '+', L) <-- if Fixnum.+ has been monkey-patched, jump to L

What we have been doing currently, with our very limited optimizations, is just "crossing our fingers" and hoping things like Fixnum#+ don't get overridden. But I think if we're able to make these sorts of guards just another part of the IR, we'll suddenly have a lot better opportunities to do the right thing every time, without having to make assumptions. The guards will perform their checks, the results will propagate through the code, and when we hand off to Hotspot most of them will get folded away.

I'm also proceeding with this discussion from a perspective that ignores the JVM's limitations on bytecode size. Adding more logic to a method, such as to have fallback branches when guards fail, will often bite us because of Hotspot's code-size and graph-complexity limits. But I think the opportunity here is to produce a flexible enough compiler chain that we can move that optimization bar up and down easily, taking advantage of newer JVMs as they come out or allowing more optimizations if additional Hotspot flags are provided.

The decomposition you're describing will definitely give us more flexibility.

Clearly, there is a lot of overhead unboxing the objects -- but that is because
the objects are being passed in as arguments.  But, you can see that if this
expression had one or both literals, the overhead drops. If both were literals, the code reduces to boxing the result into a Fixnum. But, if not, HotSpot might be able to inline the 'value' calls at this site and get rid of the unboxing overhead.

Yes, especially in Java 7 where the available escape analysis is much more robust. If we are forced to allocate boxed values, we need to at least make *damn* sure we can inline the code that produces or consumes those boxed values, so we have a chance for them to compile and optimize as part of the same compilation unit.

The experimental "dyncall inlining" flag in 1.3 may already be enabling this, since I can see multiple levels of "fib" inline together into a single unit of asm. If escape analysis is working like we expect, we should see the boxing cost disappear for several levels of calls, only returning when we have to actually CALL into a new body or RET back to a previous one.

JRuby Inlining
--------------
Since HotSpot won't really be able to optimize the unboxing and add type-based guards as above, at some point, you might have to embrace method inlining. For inlining,
you will use the high-level IR.
...
Clearly, this is still simplistic because you might be boxing the 8 into a Fixnum prematurely, and it might never be needed. So, in order to be able to optimize boxing and unboxing, it might be useful to introduce IR instructions that encapsulate boxing and unboxing of primitive types. Since the target is the JVM, you will introduce boxing/unboxing IR instructions for JVM primitives. Fixnum and Array
are obvious candidates.

So, box and unbox IR instructions could be:
   n = box_primitive(Fixnum, v)
   v = unbox_primitive(Fixnum, n)

If I remember right, I've seen similar operations in MacRuby's calls to LLVM, and I presume they get the same optimzation from it that we want.

With this, the inlined and optimized code above would become:
   ...
   method_guard(Fixnum, '+', L1)
   n = box_primitive(Fixnum, 8)
   jump L2
L1:
   args = [5]
   n = ocall 3, '+', args
L2:
   ...

Again, we have two perspectives:

If we assume no limitations on the code size and complexity we can generate (or if we assume we will always be able to work around those limitations) then emitting inlined logic with appropriate guards will be fairly easy. As you show here, we'll perform our method and type guards, with the fast path immediately following successful tests and a branch to a slow path following failed tests.

If we are pessimistic and assume that code size limitations are going to bite us, then we need to be fairly selective about what we inline. Math operations are obvious, as are other core class operations and frequently-used closure-receiving methods like Array#each. Again, having a flexible compiler chain will allow us to slide this bar up and down.

Reality will surely lie somewhere in between, but I think we should keep both extremes in mind while realizing there's more "future" than there is "present" waiting for the fruits of our labors :)

The next problem is to propagate type information throughout a method to allow optimizations to be as global as possible. In the optimization analyses, you have the option of carrying around the state implicitly (carried along in hashtables while running analyses), or explicitly (by introducing attribute instructions in the IR). The implicit approach is easy and preferable for cases where information about a variable is not use-site specific (constant propagation, for example). But for cases where this is not true, the explicit approach is preferable. For example, the type of an object along two paths of a branch would be different.
So, a variable 'v' (in SSA form) might have type 'Fixnum' along one path and
'String' along another.

I prefer the explicit approach:
- they lend themselves to easier debugging because when you dump the IR, you can
  see what information is being generated.
- you convert use-site specific information to attributes that are no longer
use-site specific (because every branched path gets its own SSA attribute). - by converting site-use specific information to constants, you can incorporate
  transformations as part of regular constant propagation via SCCP.

I have used this approach to good benefit in my JVM compiler I wrote for my
dissertation -- I used this approach to eliminate null checks and array bounds checks.

Yes, it seems to me the explicit approach would provide a much wider set of opportunities for us. A key example is turning this:

5.times {
  # stuff
}

... or this ...

a = 0
while a < 5
  # stuff
  a += 1
end

...into a primitive loop. That's only possible if we're able to tell that at each of these sites 'a' is always provably a Fixnum (assuming Fixnum operations have not been violated) and then producing a call-free, boxing-free loop. This is where we get into the real magic that a good IR and tool chain can give us; with solid type propagation, these kinds of transformations just fall out.

So, with this in mind, you will now augment all IR instructions to include an
array of attributes.  So, the generic IR form will now be:

    v = OP(operands, attributes)

When you run SCCP on this, you also run SCCP on the attribute values. This lets you use attributes as additional meta-information to optimize that instruction.

For example, consider this sequence of IR instructions:
v = 5
   a = box_primitive(Fixnum, v, [])
   a_attr = [BOXED_FIXNUM(v)]
   ...
   ... a is not used anywhere here ...
   ...
   b = a
   b_attr = a_attr
   x = unbox_primitive(Fixnum, b, [b_attr])
   y = x

When you run SCCP on this, here is what happens to the unbox_primitive instruction:
   x = unbox_primitive(Fixnum, b, [[BOXED_FIXNUM(v)]])

Thus, you know that you are unboxing a boxed value 'v'. So, you can simplify this
instruction to:
   x = v

So, when you run SCCP, the above sequence reduces to (I am excluding some useless copies here)
   v = 5
   a = box_primitive(Fixnum, v, [])
   ...
   y = 5

After you run dead code elimination, assuming a is not used anywhere, the code simplifies to
   v = 5
   y = 5

Here's an attempt to show this using my loop example from above:

Basic pseudo-IR

v = 0
a = box_primitive(Fixnum, v, [])
b = ocall a, '<', [box_primitive(Fixnum, 5, [])]
if_false [b, :end_label]
top_label: # stuff
a' = ocall a, '+' [box_primitive(Fixnum, 1, [])]
b' = ocall a', '<', [box_primitive(Fixnum, 5, [])]
if_true [b', :top_label]
end_label: .....

After propagating types

v = 0
a(Fixnum) = box_primitive(Fixnum, v, [])
b(Boolean) = ocall_typed Fixnum, a, '<', [box_primitive(Fixnum, 5, [])]
if_false [b, :end_label]
top_label: # stuff
a'(Fixnum) = ocall_typed Fixnum a, '+' [box_primitive(Fixnum, 1, [])]
b'(Boolean) = ocall_typed Fixnum a', '<', [box_primitive(Fixnum, 5, [])]
if_true [b', :top_label]
end_label: .....

And reducing/inlining based on those propagated types:

a = 0
if_ge a, 5, :end_label
top_label: #stuff
a' = add a, 1
if_lt a, 5, :top_label
end_label:

As long as Fixnum's *operations* remain as expected, this is 100% safe and legal.

Summary:
--------

This is a tremendous start! Thanks so much for jumping in with this. I think we're on the right track.

And this list is a perfect place for us to start prototyping a few simple localized examples...

1. Every IR instruction is basically of the form v = OP(operand_array, attribute_array)

   Since this is a high-level IR, you are not bound to a 3-operand format.
   You can have an array of operands and an array of attributes.
   But, you can also have IR instructions of the form:
      v = OP(op1, attribute_array)
      v = OP(op1, op2, attribute_array)

2. You have regular ALU kind of IR instructions (add, sub, cmp, lshift, etc.)

3. You have various flavor of call instructions:
     v = call(method, args)
     v = ocall(receiver, method, args)
     ... and anything else you can think of ...

4. You have a method-address instruction:
     m = meth_addr(receiver, method)

5. Instructions to extract function/method arguments and return function values:
     x = meth_arg(arg_index)
     return y

6. You have various kinds of guard instructions:
type_guard(obj, class, L) <-- if type of obj is not class, jump to L class_guard(class, L) <-- if class has been monkey-patched, jump to L method_guard(class, method, L) <-- if class.method has been monkey-patched, jump to L

7. You have box/unbox instructions for primitives:
     n = box_primitive(class, v)
     v = unbox_primitive(class, n)

8. You have attribute instructions for passing around meta-information (types, etc.)
     init_attr(a_attr, [])
     add_attr(a_attr, X)

   Note:
- a_attr is just naming convention. As far as representation is concerned,
      it is just another SSA IR variable.
- You need to identify the different types of meta-information you might pass
      around.  The X above is a place-holder for that.
Some kinds of X are:
      - BOXED_FIXNUM(n)
      - BOXED_ARRAY(a)
      - HAS_TYPE(class)
         ...

This is it for the 1st pass.  More later with an example of closures.


---------------------------------------------------------------------
To unsubscribe from this list, please visit:

   http://xircles.codehaus.org/manage_email


Reply via email to