Re: The Great Startup Problem

Thomas Wuerthinger Fri, 29 Aug 2014 05:40:54 -0700

We are happy to no longer discuss Truffle in this thread if you are looking for 
more short-term solutions and keeping the usage of invokedynamic as an 
invariant. I am confident that Truffle can reach production quality within 12 
months. People interested in Truffle can take a look at the Truffle API in the 
Graal repository and/or engage in the graal-...@openjdk.java.net mailing list.


- thomas

On 29 Aug 2014, at 13:35, Marcus Lagergren <marcus.lagerg...@oracle.com> wrote:

> I think this is an excellent summary, John. 
> 
> I also think that a scary point Charlie and I were trying to make in this 
> thread is that we have <= 12 months or so to slim down existing mechanisms to 
> something that works with startup for the existing indy solutions before 9 is 
> frozen. I think, given the time constraints, that there is no way doing this 
> by NOT working with the current compiler infrastructure. We already had some 
> nice ideas in Santa Clara this August.  I don’t think 12 months is enough 
> replace our entire indy architecture with something else entirely, or even to 
> decide what that would be. For the sake of this discussion I am ONLY wearing 
> my “lambda forms scale poorly, we need to fix that for nine” hat, which is a 
> very, very pragmatic hat indeed. I am ONLY wearing this hat. Again: ONLY THE 
> PRAGMATIC HAT. It contains very little architectural vision. 
> 
> I’d totally welcome future dynamic language threads that discuss completely 
> new architectures, but it would be good if we could split this thread in more 
> then. I’m just trying to ship a product for Java 9 that works. While I am not 
> PM, 12 months is an incredibly short time to replace the entire world. I 
> don’t think it has ever been done and I don’t think it CAN be done. Even if 
> code exists. Sorry. Pragmatic hat.
> 
> Another point, that I know Charlie has brought up earlier, is that we also 
> cannot abandon those who still target the JVM and seek value in a common 
> platform. At least not before bytecode 2.0, whatever that may entail. 
> Interesting discussion, but again - different thread I think. This means 
> bytecode land. This means indy. 
> 
> So before we soar up into the cloud free architectural sky (in other 
> threads), can we use this one to discuss things like local escape analysis 
> around indys, maybe with cheating (knowing they are indys), local array 
> explosion, indy callsite layout and by appropriating what we can of Fredrik 
> Öhrström’s magic machine code hats? If this is not what you were after, 
> Charlie, do tell me. I believe you were. 
> 
> I want make it absolutely clear, with no possibility of doubt, that I am 
> merely pragmatic right now, which is all I feel I can find the time to be in 
> the current timeframe. I need a near solution. Now. I am wearing blinkers. I 
> am looking straight ahead. The cruel master sits heavy on my saddle with a 
> backpack full of metaspace. I can’t, sadly, afford to live in any other world 
> than this at the moment. Right now, I first want to get rid of lambda form 
> scaling problems. Period. I am trying to ship product. After that we can 
> discuss where our wonderful architectures will take us.
> 
> So if possible - let’s do one thing here: discuss what we can do for 9, in 
> the time frame that is left, to make indys go faster for the use cases that 
> have cropped up. With the current architecture. With lambda forms. With the 
> nearly identical problems that both Charlie and we face. Let a thousand other 
> threads start, discussing the rest, though. That is also great and exactly 
> what this group is for, but I need some help with C1/C2 and invokedynamic at 
> the moment! I need machine code rabbits! I think that machine code rabbits 
> were also what Charlie asked for.
> 
> Regards
> Marcus
> 
> (who is really happy to see MLVM-dev vibrant again, lately)
> 
> On 29 Aug 2014, at 05:48, John Rose <john.r.r...@oracle.com> wrote:
> 
>> On Aug 22, 2014, at 1:08 PM, Charles Oliver Nutter <head...@headius.com> 
>> wrote:
>> 
>>> Marcus coaxed me into making a post about our indy issues. Our indy
>>> issues mostly surround startup and warmup time, so I'm making this a
>>> general post about startup and warmup.
>> 
>> This is a vigorous and interesting discussion.  I will make some piecemeal 
>> replies to
>> specific points, but first, as a HotSpot team member, I'll comment on the 
>> startup
>> problem and then set out our position and vision for indy and dynamic 
>> languages.
>> 
>> Achilles had two heels, and so does Java:  Startup and warmup.  Many teams 
>> of many
>> people have worked on these issues for years, with hard-won progress.  
>> Compared
>> with C programs (say, "/bin/awk"), the JVM takes a long time to get going.
>> 
>> Fundamentally this comes from the decision to package programs as bytecodes
>> (JARs, etc.) rather than machine instructions plus mappable data (dylibs, 
>> etc.).
>> As everyone on this list knows and appreciates, Java bytecodes are a 
>> portable,
>> stable, and compact intermediate representation for both code and data 
>> structure.
>> You can do an amazing variety of great things with them, and there is no 
>> credible
>> proposal (IMO) to replace them wholesale with some other representation.
>> 
>> As an inherent property, bytecodes cannot be executed directly, requiring
>> additional machinery to execute: an interpreter, JIT compiler, and/or AOT 
>> engine.
>> Making this machinery look small is an ongoing challenge for us JVM 
>> implementors.
>> I say "ongoing" (not just "historic") because software complexity scales up 
>> with time.
>> 
>> Our basic move is organizing cold stuff differently from hot stuff.
>> The C2 JIT spends lots of effort on the hot paths in hot methods.
>> On the cold side, Java's class data sharing feature tries to organize large
>> amounts of start-up bytecode in a form that can be quickly loaded and 
>> discarded.
>> The most useful distinctions between hot and cold data and code usually
>> require some sort of online measurement, which is one reason Java is
>> competitive with C++, which generally lacks behavior-dependent optimizations.
>> 
>> Building on these basic considerations, an ideal Java implementation would
>> quickly get through the start-up process quickly by executing cold code 
>> and/or
>> loading pre-configured data structures, with the result that data which must 
>> be
>> configured by every JVM on startup, or by every invocation of a given 
>> application,
>> is quickly assembled.  This would be done in some cases by loading a
>> pre-composed bitwise data image, or perhaps more efficiently by 
>> "decompressing"
>> an operationally encoded description of the initial data image by executing
>> some sort of code.  In fact, that is what Java interpreters are pretty good 
>> at,
>> when augmented by a pre-composed "class data archive" that maps in
>> a pre-parsed version of classes from "rt.jar".  (Sometimes this image is
>> found in a file called "classes.jsa".)
>> 
>> But if more of the initial state of a Java application could be demand-paged 
>> from
>> an image file (like the initial data of C programs) there might be less 
>> latency,
>> since there would certainly be fewer order constraints between the different
>> loading events (at the page and cache line level), allowing prefetch 
>> machinery
>> to hide latency.  This is an important area of growth for Java.
>> 
>> Second, an ideal Java implementation would quickly discover the particular
>> on-line characteristics of the application under execution, and produce
>> tailor-made machine code for that execution, with the result that the
>> application's critical inner loops would execute at full performance,
>> almost immediately.  This would be done in some cases by capturing
>> profile information from early phases of the application and applying them
>> to the creation of customized code.  In fact, the profiling interpreter and
>> tiered C1 JIT are pretty good at gathering such information, and
>> feeding it to the C2 JIT.
>> 
>> What's missing?  Probably a certain amount of prediction of behavior
>> based on previous runs of the application, combined with a way of
>> loading data and executing code that benefits from that prediction,
>> even before the application has gotten that far.  To me that sounds
>> like a static compiler of some sort, though an open-world kind that
>> allows the compiler's model of the world to change during execution.
>> 
>> Maybe another missing bit is a slightly more static initialization model
>> for Java, under which the assembly of initial data can be predicted
>> off-line with greater accuracy.  For starters, we should have array
>> initializers that are more distinct from random bytecode execution,
>> plus a little more discipline about the effects of <clinit> methods,
>> separating those effects into "yes we can statically model this"
>> vs. "do you really want to do this crazy thing?".
>> 
>> One wrong answer would be to do more JIT stuff, earlier and oftener.
>> But JIT optimizations are inherently throughput-oriented; they do
>> nothing do reduce start-up or spin-up overheads; they often worsen
>> those overheads.  (Example:  Iteration range splitting which makes
>> the middle of a loop run without array range checks, at the expense
>> of extra calculations at the beginning and end of the loop.)
>> 
>> So what we need is a new set of optimizations for application start-up,
>> akin to those developed for managing static data and relocations in
>> shared libraries.
>> 
>> One thing to note about Java's very dynamic startup semantics is that
>> it's not always bad.  It is not always the right idea to load a statically
>> composed bit-image of data or code, if that image is large compared
>> with its true information content (e.g., a character range table).
>> If the data is going to be traversed sequentially, sometimes it is
>> better to expand it from a less complex representation.
>> Compressed file systems exploit this insight, when they win.
>> 
>> The encoding of Java lambdas uses invokedynamic to make
>> a decisively more compact encoding, relative to inner classes,
>> which can be expanded in local memory faster (in some cases)
>> than it can be loaded from cold bits in a JAR file.  This is one
>> of the reasons Java 8 lambdas (expanded on the fly) beat inner
>> classes (precompiled into a JAR).
>> 
>> The lesson here is that if you are going to precompile, consider
>> precompiling something compact, and use a load-time translator
>> as needed.  This insight has to be traded off against the complexity
>> of the translation process, and also against the low-level advantage
>> of a pre-fetch-capable load format (such as a brute image of an
>> array of bytes).
>> 
>> That brings us to invokedynamic.  This is designed to be the swiss
>> army knife of bytecode instructions, able to do the job of ten others,
>> yet fitting in one compact carrying case.  When indy works right,
>> a specialized call site is encoded in the natural number of tokens
>> (constant pool entries), expanded quickly by its bootstrap method
>> to a method handle, where the method handle has a simple and
>> natural structure, and is compiled almost as quickly (pending only
>> last touches of profile data) to optimal machine code.
>> 
>> As a test case, an indy call site is able to emulate any other concrete
>> JVM instruction that can be described as N pops and 0 or 1 pushes,
>> and in that case should quickly compile to the same machine code
>> as if that concrete instruction were given instead.  (Why not throw
>> out the other instructions then?  Well, they can be viewed as
>> shorthand for the equivalent indy instructions, if we wish.  This
>> is suggestive of possible shapes for Bytecode 2.0, but who
>> knows when that will be.)
>> 
>> Conversely, when indy goes wrong, it is probably because there
>> are an unnatural number of bytecode constructs involved (bad
>> factoring) or the bootstrap method is doing something complicated
>> (perhaps a cache is needed somewhere) or the resulting method
>> handle is overly complex (too many "lambda forms" under the
>> covers) or the compiler is stumbling over something and failing
>> to inline and optimize (often shallow inlining or polluted profiles
>> or surprise box/unbox events).
>> 
>> Put another way, when indy goes wrong, it is because the relatively
>> simple programming model it offers is belied by internal simulation
>> overheads.  Language programmers know about simulation overheads
>> and can manage them explicitly, but indy provides an attractive way
>> to refactor them, as long as it does not add too many more overheads.
>> 
>> Is it shocking that a JVM mechanism should have simulation overheads?
>> No, it is not.  What is painful at this moment in time is that those 
>> overheads
>> have not been reduced as quickly as we hoped.  There is a combination
>> of reasons for this, which are not technically interesting.  It is still our
>> position that those overheads are not inherent to the design, but rather
>> implementation artifacts.  For example, for the sake of JDK 8 lambdas,
>> we were able to remove enough overheads to make lambdas worthy
>> successors to inner classes.
>> 
>> There is a range of proposals to overcome more of those overheads
>> by tuning or replacing the implementation.  They all boil down to
>> increased sharing of common structure early in the execution pipeline,
>> with inlining and customization later allowing fully optimized JIT code.
>> Fredrik's "rabbit out of the hat" technique radically simplifies the
>> common structures, by relying on very strong box/unbox and varargs
>> optimizations later in the pipeline.  The current architecture avoids
>> reliance on those optimizations (which are hard to get 100% correct),
>> at the expense of more redundant copies of the IR and the bytecodes.
>> 
>> (The metaphor of "pipeline" is apt, although a code pipeline has a
>> different "fluid" running through each segment.  You can characterize
>> a code pipeline by its phases:  Loader, linker, interpreter, JIT.
>> Method handle IR adds more phases, including the lambda form
>> interpreter and bytecode renderer.)
>> 
>> The current architecture makes multiple copies of each general shape
>> of method handle, to distinguish different combinations of primitives vs.
>> references.  Internally, these shapes are represented in an IR which
>> shows up as "lambda forms".
>> 
>> The current story is that the cost of making a method handle is way
>> too high:  We generate bit of IR for any non-trivial method handle,
>> and those bits get composed and shoved down the code pipeline.
>> This leads to a large load of code, some of it useful and some not.
>> Also, the IR is retained in an executable form, causing heap bloat at scale.
>> 
>> These lambda forms are obnoxious in the heap and on the execution
>> stack not because they are representing small distinctions between code
>> shape, but rather because they are often generated one per method
>> handle.  I intended this to be a temporary state, before 8 FCS, alas.
>> We will be able to improve this state, soon, but it has been a long wait.
>> 
>> One of the reasons it has been difficult to make the improvement is that,
>> paradoxically, the existing over-production of lambda forms has some
>> advantages, akin to the advantages of over-inlining.  If your instruction
>> processing pipeline can handle a 10x overload of redundant instructions,
>> sometimes you win, because profiling and prediction tactics work better
>> on single-use instructions, as opposed to shared-use instructions.
>> This is true both at the bytecode and machine-code levels.
>> 
>> Our major challenge is to preserve existing performance peaks
>> while reducing the volume of IR and bytecodes going down the
>> pipeline.  The challenge is tied to the choice of IR, but not in
>> such as radical way that we need an IR change.  The real
>> challenge, as we see it, is twofold.
>> 
>> First, as method handles are built, recognize emergent similarities
>> of structure, and reuse similar bits of IR to build similar method
>> handles.  Put negatively, don't build full custom IR for each
>> method handle.  Put positively, build generic IR if you can get
>> away with it, or at least cache repeatedly built patterns, as possible.
>> You know you have won when people stop noticing IR dregs
>> in their heaps.
>> 
>> Second, as method handles are executed (and/or as their IR or
>> bytecodes are executed), collect profile data that will be useful
>> to the eventual full-custom object code at any or all of the relevant
>> invocation sites—or nowhere, if those sites don't exist.
>> 
>> These goals are in tension with each other, regardless of what
>> you call your IR.  Exploring this tension has been a key component
>> of our recent engineering.  In basic terms, profile information should
>> be attached as early as possible to the particular method handles
>> bound at an invokedynamic site, while each chunk IR should be
>> cagily shared among as many method handles as possible.
>> 
>> (The previous requirement is expressed in terms of indy call
>> sites, but it should extend to other kinds of method handle
>> call sites.  An indy call site is, in effect, an invokeExact of a
>> compile-time constant method handle, but we also have
>> optimizations for non-constant method handle invocations,
>> both exact and inexact.  Still, the focus is on indy.)
>> 
>> We can use the "pipeline" metaphor to observe that overly long
>> pipelines have drawbacks.  If code must be dynamically translated
>> into several different states before full throughput, it is possible
>> that some intermediate state will introduce an irregularity that
>> later phases will not be able to normalize or optimize.  This is
>> the case, for example, with the lambda form interpreter.  This
>> exists (a) to work around some bootstrap issues, and (b) to allow
>> programs to warm a little before committing to bytecode.
>> (A third advantage is an executable documentation of semantics.)
>> It appears that neither benefit is strong enough to overcome the
>> disadvantage of having the JIT stumble over the lambda form
>> interpreter, so we are likely to remove that phase.
>> 
>> It does not follow, however, that the other pipeline phases for
>> lambda forms should also be removed.  In fact, neither bytecodes
>> nor bundles of some lower-level IR are a replacement for lambda
>> forms, insofar as they fill a special niche, the need for an easily
>> created representation of a stand-alone, ad hoc JVM method.
>> We need a way to build small bits of behavior that can be
>> assembled, method-wise, into larger bits of behavior.
>> There may be better ways to build such things than lambda
>> forms, but any proposal to eliminate lambda forms needs to
>> provide an up-front design for a way to compose, analyze,
>> cache, and load ad hoc methods.
>> 
>> In the near future, we expect the overhead of a method handle to
>> be more like the overhead of a closure and less like the overhead
>> of an inner class.  This means that you will win if you "compress"
>> your call sites using indy instead of open-coding them with many
>> bytecodes.  The warm-up will process the compact form, leading
>> to less net data motion for loading, while the hot form of the code
>> will be indistinguishable from the open-coded version.
>> 
>> And we expect to get better at capturing the call-site specific types,
>> values, and branch behavior of the contents of an invokedynamic site.
>> 
>> Such work is intended to replicate the hot-path effects of open-coded
>> bytecodes, without their startup time bulk.  It worked for JDK 8 lambdas.
>> We intend it to work for our other customers.
>> 
>> For example, one of our goals is that taking a generic method handle,
>> shrinking it with "asType", and then binding it to an indy site, will
>> end up having a customization effect similar to spinning customized
>> bytecodes for the internals of the generic method handle.  This will
>> get us close to Fredrik's "rabbit" design, even if we don't handle
>> arity polymorphism any time soon.  Note that value types are pushing
>> us in this direction also, since values effectively require the creation
>> of many more customizations, in a system that cannot interchange
>> boxed and unboxed representations.  An early part of this work is
>> carefully defining more optimizer-friendly semantics for primitive
>> boxes, to be extended to value boxes.
>> 
>> Moving forward, we expect invokedynamic to continue to provide a
>> powerful way to reduce the complexity of bytecode shapes of advanced
>> languages (including Java), without compromising on performance.
>> 
>> I hope this helps.  Please let me know if something is unclear or seems
>> wrong; these are certainly slippery topics.
>> 
>> Best wishes,
>> — John
>> _______________________________________________
>> mlvm-dev mailing list
>> mlvm-dev@openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
> 
> _______________________________________________
> mlvm-dev mailing list
> mlvm-dev@openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev

_______________________________________________
mlvm-dev mailing list
mlvm-dev@openjdk.java.net
http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev

Re: The Great Startup Problem

Reply via email to