On Fri, Jul 31, 2009 at 12:20 AM, Subramanya Sastry<[email protected]> wrote:
> To see this, it is useful to think of all ruby classes carrying with them a
> version number.  [ Technically, this can be done at a more fine-grained
> level (i.e. at the level of methods), but, let us start with classes for
> now. ]  So, all classes start with version number 0.  Whenever you modify a
> class in some fashion (add a method, monkey patch a method, etc.), you
> increment the version number by 1.

This is largely how the current "token" logic works. Whenever a class
or its superclasses changes, its token is flipped. It used to be a
randomly-generated number, but John Rose of Hotspot suggested we just
Object.new the token, since object comparison is fast. It basically
represents our version number for each class though...

> You need to worry about code versions only for those call sites where you
> know enough information about the call site that you can make assumptions
> about the class, method, and method code AND also inline the method code at
> the call site.

...inline or make other potentially-breaking assumptions, like whether
you expect backrefs, whether you need to lift locals to the heap,
etc...but yes, you're right.

> So, all I need to do to ensure correctness is to add a version assumption
> information before the call site S before inlining, as follows:
>
>      ASSERT_VERSION(C, V, L1)   <-- if C is not at version V, branch to L1
> where you have code that dispatches to C.M without inlining the code
>      .... inlined body of method M ....
>      jump L2
> L1: v =o.m(args)
> L2:
>
> At this level, this code will ensure correctness of inlining, even if C is
> Fixnum and M is +.  Note that so far, this does not require any OSR.  We
> will get to how to implement versioning cheaply in a little bit.  But, for
> now note that while this ensures correctness in the general case, this is
> not good enough for Fixnum.  To see this, consider this code:
>      a = 3+5
>      b = a+1
>
> Assume that Fixnum is at version 0.  This will translate to:
>
>    ASSERT_VERSION(Fixnum, 0, L1)
>    a = 8
>    jump L2
> L1: a = call(Fixnum.+, [3,5])
> L2: b = call(lookup_method(a, +), [a, 1])
>
> Just after one inlining, the possibility of optimization stops. because at
> L2, you don't know anymore what the type of 'a' is.  Imagine some moron
> implementing Fixnum.+ to return "hello world" always.
>
> Going back to our generic case, an alternative implementation of the
> inlining optimization at S with an assert version before it is as follows:
>
>      ASSERT_VERSION(C, V, L1)   <-- if C is not at version V, branch to L1
> where you have code that dispatches to C.M without inlining the code
>      .... inlined body of method M ....
>      .... rest of the code for the current method ...
>      return <something>
>
> L1: v = o.m(args)
>      ... rest of the code for the current method...
>      return <something>

Yeah, this picture has been forming in my head as well. If we can
safely and cheaply guard inlined or optimized call logic, then it
opens up a lot of new possibilities for aggressively optimizing
without breaking anything.

You can even take this one step further by having L1 immediately call
a deoptimized version of "the rest of the code", passing all current
method state along. If it's never hit, hotspot won't care. If it's hit
frequently, hotspot will inline it anway. But if it's hit frequently,
we have an easy way to back off and try a different optimization or
more profiling by marking the whole shebang as "not entrant" and
letting it age out.

I think the key to making this work is reducing the version checks as
much as possible and keeping the "everything is ok" code localized at
the top of the method body, with all failure cases immediately bombing
out to a second deoptimized version later in the same body or in
another body entirely. That allows hotspot to easily see the optimized
path and work its magic without excessive branches or
optimized/deoptimized splices getting in the way.

> So, this alternative implementation treats a version failure as an exception
> and branches to an entirely new copy of the rest of the method.  If
> implemented naively, you can run into trouble with lots of exception cases
> for the method.  We'll get to that in a bit.  For now, let us go back to the
> previous example of two back to back additions:
>      a = 3+5
>      b = a+1
>
> So, you would compile this as:
>
>     ASSERT_VERSION(Fixnum, 0, L1)
>     a = 8
>     ASSERT_VERSION(Fixnum, 0, L2)
>     b = 9
>     ...
>
> L1: a = call(Fixnum.+, [3,5])
>      .. and so on ..
>
> L2: b = call(Fixnum.+,[8, 1])
>     ... and so on ..
>
> But, there is one clear optimization here which is that the second
> ASSERT_VERSION(Fixnum, 0, L2) is useless because there is no possibility of
> Fixnum being modified between a = 8 and b = 9.  So, we can get rid of the
> second assert version and its exception code altogether.  If you take this
> line of reasoning further, you will notice that the in a straightline piece
> of code, ASSERT_VERSION instructions will only be present between
> non-inlined calls (to whatever method it might be).  Additionally, you can
> only get rid of them as far as basic block boundaries ..

Yeah, exactly. I think if we have it at the IR level that we're doing
these version checks, a lot of them will simply disappear. And I know
from working on the JVM emitter that it will be trivial for me to look
at the IR and say "oh, we have an ASSERT_VERSION_Instr...I will now
fork my code generation into an optimized path and a deoptimized path.

> Anyway, all these are optimizations of placement of ASSERT_VERSION
> instructions.  This can be explore in a separate thread.  I have not thought
> through it completely.  But, it doesn't seem to be such a big deal.  We have
> reduced the problem of optimizing in the presence of open classes to one of
> optimizing placement of ASSERT_VERSION instructions.
>
> So far, I've not addressed how to implement the ASSERT_VERSION instruction.
> The straightforward way to do this is clearly to have a 'version' field in
> the platform implementation of the Class ruby class.  So, effectively
> ASSERT_VERSION(C, V, L) will reduce to:
>
> v = get_field(C, version)
> bne(v, V, L)
>
> When compiled to machine code, this will get compiled to something like
> this:
>
>   %r1 = LOAD C, field_offset_version
>   BNE(%r1, V, L)
>
> This does seem like an expensive pair of instructions.  A load and a
> branch.  But, with one additional optimization, I'll argue how these
> effectively become NOPs in modern superscalar processors.

Currently for the inline cache it's basically:

cache = get_cache_tuple
token = get_cache_token(cache)
newtoken = get_class_token(o.class)
if token == newtoken
  // fast path
else
  // slow path with recache
end

Much of this could be stuffed behind utility code, as in:

if (token_valid(index_of_cached_token, incoming_object)) {
  // fast path
...

> Since there is no real ruby field called version in Class, there is no
> requirement that version be stored in the class.  You can do all kinds of
> optimizations.  You can store them in a global array, a global hash.  Or
> better yet, you can store them all in a large pool of global variables
> called Fixnum_Version, Float_Version, Array_Version, and so on.  Or, you can
> store the version numbers for the core classes in a set of global variables
> as here, and the rest can stay in the implementation of Class.  These are
> details that can be worked out.

Currently they're an array of method, token tuples stored in each
compilation unit and accessed by hardcoded indices in the bytecode.
That's all abstracted behind the current InlineCachingCallSite.

> For now, assume that the version number for the Fixnum class is stored in
> the global variable (a static member of a java class called VersionNumbers
> or something like that) called Fixnum_Version.  If we assume that Fixnum is
> not modified, V will be zero.  So, ASSERT_VERSION(Fixnum, V, L) will reduce
> to:
>
> %r1 = LOAD Fixnum_Version
> BNE(%r1, 0, L)
>
> Fixnum_Version will very likely end up in the data cache of the processor
> and will always hit there (since there is no store that updates the version
> number that flushes it out).  Next, the BNE branch will always be false.
> So, the branch predictor will quickly latch onto that.  Third, since the
> fall through is the common case (version number is indeed 0), the next set
> of instructions will always be in the instruction cache and wont get
> flushed.  So, the net effect of the load and the branch is to increase the
> instruction count but not necessarily to impact the critical execution path
> within the processor.  So, these are cheap.

Yeah, this is one key thing that's missing at present: we don't have a
fast path to get at *specific* core class version tokens, so we can't
globally check whether, for example, Fixnum has been modified. And we
may want to take it one step further... %r1 = LOAD
Fixnum_+_Version...so if people *add* methods to Fixnum we'll still
know that the bulk of them are still safe and inlinable.

> Now, consider this piece of ruby code:  (1..100).inject(0) { |s,i| s + i }
>
> This could get compiled to:
>
>    ASSERT_VERSION(Range, 0, L)
>    ASSERT_VERSION(Fixnum, 0, L)
>    fully inlined version of the loop with regular fixnum additions
>
> L: regular unoptimized version.
>
> I haven't fully thought through this, but it appears like this optimization
> strategy will work quite good if we can now work out the problem of how to
> optimize placement of ASSERT_VERSION IR instructions.  That seems more like
> a more manageable dataflow problem at this point.  But, I might be speaking
> too soon of course.

No no, I think you're exactly on the right path. The strategy is
sound. From experience, I know the details will be tricky, especially
making sure that all these guards don't obscure the optimized path
from Hotspot, but when I've attempted this in the past it was only to
do a static dispatch, not to inline loops and math. The inlined cases
could show *tremendous* improvement in performance, and I would wager
the version checks get picked up by Hotspot and optimized extremely
well.

It may be possible to prototype some limited cases with the current
compiler and see what the assembly output looks like.

> PS: Note that whenever a class version increments, you will de-opt/recompile
> all code that depended on the older class version.  But, that won't
> guarantee correctness of the currently executing code without some form of
> OSR.   So, this strategy here will ensure that currently execute code
> continues to produce correct results even without OSR.

Yeah, I think the above strategy is as close as we can get. But it may
be as close as we *need* to get if hotspot does a good job optimizing
those guards. And ultimately, it may work extremely well if we can
coarsen the guard checks and eliminate boxed math or closure overhead.

Very nice :)

- Charlie

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email


Reply via email to