Here is one possibility for optimistic optimizations of ruby code in the
presence of open classes without on-stack replacement (OSR).
To see this, it is useful to think of all ruby classes carrying with them a
version number. [ Technically, this can be done at a more fine-grained
level (i.e. at the level of methods), but, let us start with classes for
now. ] So, all classes start with version number 0. Whenever you modify a
class in some fashion (add a method, monkey patch a method, etc.), you
increment the version number by 1.
When you compile a piece of ruby code, if you compile without assuming
specific version numbers of a class, you dont get into trouble. But, where
you do make assumptions to optimize code, the compiled code is effectively
valid for the snapshot of the various version numbers of classes at the time
of compilation.
Since we are dealing with versioning of classes and methods, it is only at
call sites that you need to worry about code versions. There are 3
assumptions you can make at a call site.
Let us consider a call site S: o.m(args)
(1) The specific class of the receiver (that o is an instance of class C,
ex: Fixnum)
(2) The specific method that is being dispatched to (that m is the specific
method M within C, ex: +)
(3) The specific code of the method that you are dispatching to (that M is
implemented by a specific set of instructions I1; I2; ... I_n; ex: + is
implemented by the java bytecode iadd)
You need to worry about code versions only for those call sites where you
know enough information about the call site that you can make assumptions
about the class, method, and method code AND also inline the method code at
the call site.
To see this, consider the call site "3.+(x)" where you don't know the type
of 'x'
Here you know that 3 is a fixnum, and the '+' is the addition method. But,
without knowing the type of x, you cannot really inline the '+' method of
Fixnum and convert this into an iadd. Therefore, even if the '+' method is
overridden to something else, since the '+' method slot in the method table
of Fixnum will remain the same, the code at this call site really doesn't
change.
[ Note that implementing inline caching and adding type guards are one way
of narrowing down the dynamic information at the call site to specifics.
So, at call sites that get unwrapped by ICs and PICs, where you decide to
inline target calls, you are making an assumption about the code version of
the class of the receiver. This is just a special case of what I am writing
here and doesn't change the discussion. ]
So far so good. Let's go back to our generic call site S: v = o.m(args)
Let us say that I inline code at S assuming that the receiver o is an
instance of C and also that C is at version V at this time. [ As stated
earlier, I can make this more finegrained by making an assumption that
method M is at version V rather than class C, but that is a refinement. ]
So, all I need to do to ensure correctness is to add a version assumption
information before the call site S before inlining, as follows:
ASSERT_VERSION(C, V, L1) <-- if C is not at version V, branch to L1
where you have code that dispatches to C.M without inlining the code
.... inlined body of method M ....
jump L2
L1: v =o.m(args)
L2:
At this level, this code will ensure correctness of inlining, even if C is
Fixnum and M is +. Note that so far, this does not require any OSR. We
will get to how to implement versioning cheaply in a little bit. But, for
now note that while this ensures correctness in the general case, this is
not good enough for Fixnum. To see this, consider this code:
a = 3+5
b = a+1
Assume that Fixnum is at version 0. This will translate to:
ASSERT_VERSION(Fixnum, 0, L1)
a = 8
jump L2
L1: a = call(Fixnum.+, [3,5])
L2: b = call(lookup_method(a, +), [a, 1])
Just after one inlining, the possibility of optimization stops. because at
L2, you don't know anymore what the type of 'a' is. Imagine some moron
implementing Fixnum.+ to return "hello world" always.
Going back to our generic case, an alternative implementation of the
inlining optimization at S with an assert version before it is as follows:
ASSERT_VERSION(C, V, L1) <-- if C is not at version V, branch to L1
where you have code that dispatches to C.M without inlining the code
.... inlined body of method M ....
.... rest of the code for the current method ...
return <something>
L1: v = o.m(args)
... rest of the code for the current method...
return <something>
So, this alternative implementation treats a version failure as an exception
and branches to an entirely new copy of the rest of the method. If
implemented naively, you can run into trouble with lots of exception cases
for the method. We'll get to that in a bit. For now, let us go back to the
previous example of two back to back additions:
a = 3+5
b = a+1
So, you would compile this as:
ASSERT_VERSION(Fixnum, 0, L1)
a = 8
ASSERT_VERSION(Fixnum, 0, L2)
b = 9
...
L1: a = call(Fixnum.+, [3,5])
.. and so on ..
L2: b = call(Fixnum.+,[8, 1])
... and so on ..
But, there is one clear optimization here which is that the second
ASSERT_VERSION(Fixnum, 0, L2) is useless because there is no possibility of
Fixnum being modified between a = 8 and b = 9. So, we can get rid of the
second assert version and its exception code altogether. If you take this
line of reasoning further, you will notice that the in a straightline piece
of code, ASSERT_VERSION instructions will only be present between
non-inlined calls (to whatever method it might be). Additionally, you can
only get rid of them as far as basic block boundaries ..
Anyway, all these are optimizations of placement of ASSERT_VERSION
instructions. This can be explore in a separate thread. I have not thought
through it completely. But, it doesn't seem to be such a big deal. We have
reduced the problem of optimizing in the presence of open classes to one of
optimizing placement of ASSERT_VERSION instructions.
So far, I've not addressed how to implement the ASSERT_VERSION instruction.
The straightforward way to do this is clearly to have a 'version' field in
the platform implementation of the Class ruby class. So, effectively
ASSERT_VERSION(C, V, L) will reduce to:
v = get_field(C, version)
bne(v, V, L)
When compiled to machine code, this will get compiled to something like
this:
%r1 = LOAD C, field_offset_version
BNE(%r1, V, L)
This does seem like an expensive pair of instructions. A load and a
branch. But, with one additional optimization, I'll argue how these
effectively become NOPs in modern superscalar processors.
Since there is no real ruby field called version in Class, there is no
requirement that version be stored in the class. You can do all kinds of
optimizations. You can store them in a global array, a global hash. Or
better yet, you can store them all in a large pool of global variables
called Fixnum_Version, Float_Version, Array_Version, and so on. Or, you can
store the version numbers for the core classes in a set of global variables
as here, and the rest can stay in the implementation of Class. These are
details that can be worked out.
For now, assume that the version number for the Fixnum class is stored in
the global variable (a static member of a java class called VersionNumbers
or something like that) called Fixnum_Version. If we assume that Fixnum is
not modified, V will be zero. So, ASSERT_VERSION(Fixnum, V, L) will reduce
to:
%r1 = LOAD Fixnum_Version
BNE(%r1, 0, L)
Fixnum_Version will very likely end up in the data cache of the processor
and will always hit there (since there is no store that updates the version
number that flushes it out). Next, the BNE branch will always be false.
So, the branch predictor will quickly latch onto that. Third, since the
fall through is the common case (version number is indeed 0), the next set
of instructions will always be in the instruction cache and wont get
flushed. So, the net effect of the load and the branch is to increase the
instruction count but not necessarily to impact the critical execution path
within the processor. So, these are cheap.
Now, consider this piece of ruby code: (1..100).inject(0) { |s,i| s + i }
This could get compiled to:
ASSERT_VERSION(Range, 0, L)
ASSERT_VERSION(Fixnum, 0, L)
fully inlined version of the loop with regular fixnum additions
L: regular unoptimized version.
I haven't fully thought through this, but it appears like this optimization
strategy will work quite good if we can now work out the problem of how to
optimize placement of ASSERT_VERSION IR instructions. That seems more like
a more manageable dataflow problem at this point. But, I might be speaking
too soon of course.
Subbu.
PS: Note that whenever a class version increments, you will de-opt/recompile
all code that depended on the older class version. But, that won't
guarantee correctness of the currently executing code without some form of
OSR. So, this strategy here will ensure that currently execute code
continues to produce correct results even without OSR.