> Here 'call' could be a regular C call for all you care.  So, what would
>> happen in the compiler is that you would first transform the code into
>> a higher-level IR (with virtual calls as above), and at some point in the
>> optimization, you will ratchet down the IR one level lower that might
>> expose other optimizations such as this one.  If you took care to maintain
>> the SSA-ness during transformations, optimizations such as eliminating
>> duplicate method look ups can be done by the SCCP (sparse conditional
>> constant
>> propagation) algorithm which is linear and pretty fast for all practical
>> purposes.
>>
>
> Yes, this fits what I was thinking as well. I believe part of our problem
> in optimizing Ruby is that we only ever represent things at the highest
> levels, ignoring the fact that there are very important low-level operations
> (like function retrieval) that should enter into analysis. So in essence,
> we've been going straight from our high-level, coarse-grained IR (the AST)
> to the target language (JVM bytecode) with only primitive analysis in
> between. What we want is to decompose the high-level IR (perhaps after first
> making it a bit nicer than an AST) into the *actual* component operations,
> and performing multiple levels of analysis before producing the eventual
> result. Yes, I see it.
>
> This also enters into what we talked about at the coffee shop, where
> because MRI is green threaded we have some implicit boundaries at which we
> must propagate cross-thread changes to global state (like method tables).
> And that if we represent these operations at an IR level, we may be able to
> coarsen things like method override checks until the next implicit
> thread-context boundary. Very nice.


I may have been wrong.   I had a chance to think through this a little bit
more.

Consider this ruby code:

i = 5
v1 = i + 1
some_random_method_call()
v2 = i + 1

In this code snippet, the second '+' might not get optimized because
'some_random_method_call' could monkeypatch Fixnum.+ in the course of
whatever it does.  This means that calls form hard walls beyond which you
cannot hoist method and class modification guards (unless of course, you can
precisely determine the set of methods any given call can modify -- for
example, that 'some_random_method_call' will not modify Fixnum.+).

So, this is somewhat similar to pointer alias analysis in C programs.  In C
programs, the ability to pass around pointers and manipulate them in an
unrestricted fashion effectively restricts the kind of optimizations that a
compiler can do.  Similarly, in Ruby programs, it seems to me that the
ability for code to arbitrarily modify code effectively restricts the kind
of optimizations that a compiler can do.  Open classes and eval and all of
that are to Ruby what pointers are to C.  Potentially powerful for the
programmer, but hell for the compiler.  More on this further below after
discussion of the multi-threaded scenario.

Having throught through multi-threaded program scenarios a little bit, it
seems to me that for the purposes of optimization, you can assume a
single-threaded program and optimize it.

Without this, thread-switch-points effectively restrict how far you can
hoist guards (and hence remove them) in addition to calls as above.  To
motivate this, consider this ruby code running in thread 1 (for the purpose
of this example, assume this is a for-loop, not a closure).

   10000.times { |i| puts "+1 is #{i+1}" }

If thread 2 modifies Fixnum.+, you might expect the modification to be
reflected in thread 1 at some point.  If you hoist method/class guards all
the way outside the loop and convert the i+1 to integer addition, code
modifications from thread 2 won't propagate to thread 1.

But, in this scenario, I would argue that any code that relies on behavior
like this is broken.  This is effectively a race condition and different
ruby implementations will yield different behavior.  In fact, code
modification by meta-programming is effectively modifying class meta-data.
So, concurrent code modification is not very different from concurrent
writes to program data.   I am not sure if the meta programming model treats
modification to open classes as changes to a program-visible meta-data.  If
it were, then programmer can then synchronize on that meta-data object
before patching code.  But, I expect this is not the case because that would
then force every method call to acquire a lock on the class meta-data.

Where does this leave us?  Consider a Rails app where there is a
UpdateCodeController with a method called load_new_code(C, m) which
basically updates running code with a new version of a method 'm' in class
'C' (lets say a bug fix).  This discussion of code modification then boils
down to asking the question: at what point does the new method become
available?  In the absence of code modification synchronization (on C's
metadata object), the ruby implementation can block on this request while
continuing to run the old version of 'm' till it is convenient to switch
over to new code!

This is obviously a contrived example, but the point of this is: If there is
no mechanism for the programmer to force code modifications to be visible in
a ruby implementation, the ruby implementation has flexibility over how long
it turns a blind eye to code modifications in a thread other than the
currently executing thread.  So, what you need to worry about is figuring
out the hard boundaries for method and class guards assuming a
single-threaded program and optimizing for that scenario.

Am I missing something?

Now, back to the single-threaded scenario and the earlier example:

i = 5
v1 = i + 1
some_random_method_call()
v2 = i + 1

As discussed earlier, the problem here is determining the set of methods
that a call can modify, and this is somewhat similar to the C pointer
aliasing problem.  That by itself is "not a big deal" because unlike C
pointers, code modification is much rarer.  So, in the ruby implementation,
you could optimistically / speculatively optimize a ton of code assuming NO
code modifications and move the burden of invalidation from call sites
(which are everywhere) to the centralized class meta-data structures i.e.
you install listeners/traps/callbacks (whatever you want to call them) on
the class and when a method is modified, you invalidate all optimized code
that assumes that the method is "static".  This is non-trivial (will require
on-stack replacement of code), but at least conceivable.  But, since you are
compiling for the JVM (or LLVM in the case of MacRuby), once you JIT, you
effectively cede control.  You dont have a mechanism to invalidate JIT-ted
code.  So, at this time, I am not sure if you can really get around this
(extremely pessimistic) constraint.  Only ruby implementations that can
control the compilation stack all the way to machine code (or have hooks to
invalidate / modify existing code) might be able to get around this.  I am
curious how MacRuby-LLVM combination is tackling this ...

This is not a show-stopper.  This just limits how much you can optimize.

Am I missing something?



 JRuby Inlining
> --------------
> Since HotSpot won't really be able to optimize the unboxing and add
> type-based guards
> as above, at some point, you might have to embrace method inlining.  For
> inlining,
> you will use the high-level IR.
>
...
>
>> Clearly, this is still simplistic because you might be boxing the 8 into a
>> Fixnum
>> prematurely, and it might never be needed.  So, in order to be able to
>> optimize
>> boxing and unboxing, it might be useful to introduce IR instructions that
>> encapsulate
>> boxing and unboxing of primitive types.  Since the target is the JVM, you
>> will
>> introduce boxing/unboxing IR instructions for JVM primitives.  Fixnum and
>> Array
>> are obvious candidates.
>>
>> So, box and unbox IR instructions could be:
>>   n = box_primitive(Fixnum, v)
>>   v = unbox_primitive(Fixnum, n)
>>
>
> If I remember right, I've seen similar operations in MacRuby's calls to
> LLVM, and I presume they get the same optimzation from it that we want.
>
>  With this, the inlined and optimized code above would become:
>>   ...
>>   method_guard(Fixnum, '+', L1)
>>   n = box_primitive(Fixnum, 8)
>>   jump L2
>> L1:
>>   args = [5]
>>   n = ocall 3, '+', args
>> L2:
>>   ...
>>
>
> Again, we have two perspectives:
>
> If we assume no limitations on the code size and complexity we can generate
> (or if we assume we will always be able to work around those limitations)
> then emitting inlined logic with appropriate guards will be fairly easy. As
> you show here, we'll perform our method and type guards, with the fast path
> immediately following successful tests and a branch to a slow path following
> failed tests.
>
> If we are pessimistic and assume that code size limitations are going to
> bite us, then we need to be fairly selective about what we inline. Math
> operations are obvious, as are other core class operations and
> frequently-used closure-receiving methods like Array#each. Again, having a
> flexible compiler chain will allow us to slide this bar up and down.



Makes sense!

I thought I will start working on an example of closures, but got sucked
into the code modification problem.  Clearly, having a rock-solid code
invalidation mechanism is crucial to good optimizations.  What you can
invalidate, and when you can invalidate will effectively dictate what and
how much you can optimize.  Perhaps this should be a separate thread.

Subbu.

Reply via email to