We are happy to no longer discuss Truffle in this thread if you are looking for more short-term solutions and keeping the usage of invokedynamic as an invariant. I am confident that Truffle can reach production quality within 12 months. People interested in Truffle can take a look at the Truffle API in the Graal repository and/or engage in the graal-...@openjdk.java.net mailing list.
- thomas On 29 Aug 2014, at 13:35, Marcus Lagergren <marcus.lagerg...@oracle.com> wrote: > I think this is an excellent summary, John. > > I also think that a scary point Charlie and I were trying to make in this > thread is that we have <= 12 months or so to slim down existing mechanisms to > something that works with startup for the existing indy solutions before 9 is > frozen. I think, given the time constraints, that there is no way doing this > by NOT working with the current compiler infrastructure. We already had some > nice ideas in Santa Clara this August. I don’t think 12 months is enough > replace our entire indy architecture with something else entirely, or even to > decide what that would be. For the sake of this discussion I am ONLY wearing > my “lambda forms scale poorly, we need to fix that for nine” hat, which is a > very, very pragmatic hat indeed. I am ONLY wearing this hat. Again: ONLY THE > PRAGMATIC HAT. It contains very little architectural vision. > > I’d totally welcome future dynamic language threads that discuss completely > new architectures, but it would be good if we could split this thread in more > then. I’m just trying to ship a product for Java 9 that works. While I am not > PM, 12 months is an incredibly short time to replace the entire world. I > don’t think it has ever been done and I don’t think it CAN be done. Even if > code exists. Sorry. Pragmatic hat. > > Another point, that I know Charlie has brought up earlier, is that we also > cannot abandon those who still target the JVM and seek value in a common > platform. At least not before bytecode 2.0, whatever that may entail. > Interesting discussion, but again - different thread I think. This means > bytecode land. This means indy. > > So before we soar up into the cloud free architectural sky (in other > threads), can we use this one to discuss things like local escape analysis > around indys, maybe with cheating (knowing they are indys), local array > explosion, indy callsite layout and by appropriating what we can of Fredrik > Öhrström’s magic machine code hats? If this is not what you were after, > Charlie, do tell me. I believe you were. > > I want make it absolutely clear, with no possibility of doubt, that I am > merely pragmatic right now, which is all I feel I can find the time to be in > the current timeframe. I need a near solution. Now. I am wearing blinkers. I > am looking straight ahead. The cruel master sits heavy on my saddle with a > backpack full of metaspace. I can’t, sadly, afford to live in any other world > than this at the moment. Right now, I first want to get rid of lambda form > scaling problems. Period. I am trying to ship product. After that we can > discuss where our wonderful architectures will take us. > > So if possible - let’s do one thing here: discuss what we can do for 9, in > the time frame that is left, to make indys go faster for the use cases that > have cropped up. With the current architecture. With lambda forms. With the > nearly identical problems that both Charlie and we face. Let a thousand other > threads start, discussing the rest, though. That is also great and exactly > what this group is for, but I need some help with C1/C2 and invokedynamic at > the moment! I need machine code rabbits! I think that machine code rabbits > were also what Charlie asked for. > > Regards > Marcus > > (who is really happy to see MLVM-dev vibrant again, lately) > > On 29 Aug 2014, at 05:48, John Rose <john.r.r...@oracle.com> wrote: > >> On Aug 22, 2014, at 1:08 PM, Charles Oliver Nutter <head...@headius.com> >> wrote: >> >>> Marcus coaxed me into making a post about our indy issues. Our indy >>> issues mostly surround startup and warmup time, so I'm making this a >>> general post about startup and warmup. >> >> This is a vigorous and interesting discussion. I will make some piecemeal >> replies to >> specific points, but first, as a HotSpot team member, I'll comment on the >> startup >> problem and then set out our position and vision for indy and dynamic >> languages. >> >> Achilles had two heels, and so does Java: Startup and warmup. Many teams >> of many >> people have worked on these issues for years, with hard-won progress. >> Compared >> with C programs (say, "/bin/awk"), the JVM takes a long time to get going. >> >> Fundamentally this comes from the decision to package programs as bytecodes >> (JARs, etc.) rather than machine instructions plus mappable data (dylibs, >> etc.). >> As everyone on this list knows and appreciates, Java bytecodes are a >> portable, >> stable, and compact intermediate representation for both code and data >> structure. >> You can do an amazing variety of great things with them, and there is no >> credible >> proposal (IMO) to replace them wholesale with some other representation. >> >> As an inherent property, bytecodes cannot be executed directly, requiring >> additional machinery to execute: an interpreter, JIT compiler, and/or AOT >> engine. >> Making this machinery look small is an ongoing challenge for us JVM >> implementors. >> I say "ongoing" (not just "historic") because software complexity scales up >> with time. >> >> Our basic move is organizing cold stuff differently from hot stuff. >> The C2 JIT spends lots of effort on the hot paths in hot methods. >> On the cold side, Java's class data sharing feature tries to organize large >> amounts of start-up bytecode in a form that can be quickly loaded and >> discarded. >> The most useful distinctions between hot and cold data and code usually >> require some sort of online measurement, which is one reason Java is >> competitive with C++, which generally lacks behavior-dependent optimizations. >> >> Building on these basic considerations, an ideal Java implementation would >> quickly get through the start-up process quickly by executing cold code >> and/or >> loading pre-configured data structures, with the result that data which must >> be >> configured by every JVM on startup, or by every invocation of a given >> application, >> is quickly assembled. This would be done in some cases by loading a >> pre-composed bitwise data image, or perhaps more efficiently by >> "decompressing" >> an operationally encoded description of the initial data image by executing >> some sort of code. In fact, that is what Java interpreters are pretty good >> at, >> when augmented by a pre-composed "class data archive" that maps in >> a pre-parsed version of classes from "rt.jar". (Sometimes this image is >> found in a file called "classes.jsa".) >> >> But if more of the initial state of a Java application could be demand-paged >> from >> an image file (like the initial data of C programs) there might be less >> latency, >> since there would certainly be fewer order constraints between the different >> loading events (at the page and cache line level), allowing prefetch >> machinery >> to hide latency. This is an important area of growth for Java. >> >> Second, an ideal Java implementation would quickly discover the particular >> on-line characteristics of the application under execution, and produce >> tailor-made machine code for that execution, with the result that the >> application's critical inner loops would execute at full performance, >> almost immediately. This would be done in some cases by capturing >> profile information from early phases of the application and applying them >> to the creation of customized code. In fact, the profiling interpreter and >> tiered C1 JIT are pretty good at gathering such information, and >> feeding it to the C2 JIT. >> >> What's missing? Probably a certain amount of prediction of behavior >> based on previous runs of the application, combined with a way of >> loading data and executing code that benefits from that prediction, >> even before the application has gotten that far. To me that sounds >> like a static compiler of some sort, though an open-world kind that >> allows the compiler's model of the world to change during execution. >> >> Maybe another missing bit is a slightly more static initialization model >> for Java, under which the assembly of initial data can be predicted >> off-line with greater accuracy. For starters, we should have array >> initializers that are more distinct from random bytecode execution, >> plus a little more discipline about the effects of <clinit> methods, >> separating those effects into "yes we can statically model this" >> vs. "do you really want to do this crazy thing?". >> >> One wrong answer would be to do more JIT stuff, earlier and oftener. >> But JIT optimizations are inherently throughput-oriented; they do >> nothing do reduce start-up or spin-up overheads; they often worsen >> those overheads. (Example: Iteration range splitting which makes >> the middle of a loop run without array range checks, at the expense >> of extra calculations at the beginning and end of the loop.) >> >> So what we need is a new set of optimizations for application start-up, >> akin to those developed for managing static data and relocations in >> shared libraries. >> >> One thing to note about Java's very dynamic startup semantics is that >> it's not always bad. It is not always the right idea to load a statically >> composed bit-image of data or code, if that image is large compared >> with its true information content (e.g., a character range table). >> If the data is going to be traversed sequentially, sometimes it is >> better to expand it from a less complex representation. >> Compressed file systems exploit this insight, when they win. >> >> The encoding of Java lambdas uses invokedynamic to make >> a decisively more compact encoding, relative to inner classes, >> which can be expanded in local memory faster (in some cases) >> than it can be loaded from cold bits in a JAR file. This is one >> of the reasons Java 8 lambdas (expanded on the fly) beat inner >> classes (precompiled into a JAR). >> >> The lesson here is that if you are going to precompile, consider >> precompiling something compact, and use a load-time translator >> as needed. This insight has to be traded off against the complexity >> of the translation process, and also against the low-level advantage >> of a pre-fetch-capable load format (such as a brute image of an >> array of bytes). >> >> That brings us to invokedynamic. This is designed to be the swiss >> army knife of bytecode instructions, able to do the job of ten others, >> yet fitting in one compact carrying case. When indy works right, >> a specialized call site is encoded in the natural number of tokens >> (constant pool entries), expanded quickly by its bootstrap method >> to a method handle, where the method handle has a simple and >> natural structure, and is compiled almost as quickly (pending only >> last touches of profile data) to optimal machine code. >> >> As a test case, an indy call site is able to emulate any other concrete >> JVM instruction that can be described as N pops and 0 or 1 pushes, >> and in that case should quickly compile to the same machine code >> as if that concrete instruction were given instead. (Why not throw >> out the other instructions then? Well, they can be viewed as >> shorthand for the equivalent indy instructions, if we wish. This >> is suggestive of possible shapes for Bytecode 2.0, but who >> knows when that will be.) >> >> Conversely, when indy goes wrong, it is probably because there >> are an unnatural number of bytecode constructs involved (bad >> factoring) or the bootstrap method is doing something complicated >> (perhaps a cache is needed somewhere) or the resulting method >> handle is overly complex (too many "lambda forms" under the >> covers) or the compiler is stumbling over something and failing >> to inline and optimize (often shallow inlining or polluted profiles >> or surprise box/unbox events). >> >> Put another way, when indy goes wrong, it is because the relatively >> simple programming model it offers is belied by internal simulation >> overheads. Language programmers know about simulation overheads >> and can manage them explicitly, but indy provides an attractive way >> to refactor them, as long as it does not add too many more overheads. >> >> Is it shocking that a JVM mechanism should have simulation overheads? >> No, it is not. What is painful at this moment in time is that those >> overheads >> have not been reduced as quickly as we hoped. There is a combination >> of reasons for this, which are not technically interesting. It is still our >> position that those overheads are not inherent to the design, but rather >> implementation artifacts. For example, for the sake of JDK 8 lambdas, >> we were able to remove enough overheads to make lambdas worthy >> successors to inner classes. >> >> There is a range of proposals to overcome more of those overheads >> by tuning or replacing the implementation. They all boil down to >> increased sharing of common structure early in the execution pipeline, >> with inlining and customization later allowing fully optimized JIT code. >> Fredrik's "rabbit out of the hat" technique radically simplifies the >> common structures, by relying on very strong box/unbox and varargs >> optimizations later in the pipeline. The current architecture avoids >> reliance on those optimizations (which are hard to get 100% correct), >> at the expense of more redundant copies of the IR and the bytecodes. >> >> (The metaphor of "pipeline" is apt, although a code pipeline has a >> different "fluid" running through each segment. You can characterize >> a code pipeline by its phases: Loader, linker, interpreter, JIT. >> Method handle IR adds more phases, including the lambda form >> interpreter and bytecode renderer.) >> >> The current architecture makes multiple copies of each general shape >> of method handle, to distinguish different combinations of primitives vs. >> references. Internally, these shapes are represented in an IR which >> shows up as "lambda forms". >> >> The current story is that the cost of making a method handle is way >> too high: We generate bit of IR for any non-trivial method handle, >> and those bits get composed and shoved down the code pipeline. >> This leads to a large load of code, some of it useful and some not. >> Also, the IR is retained in an executable form, causing heap bloat at scale. >> >> These lambda forms are obnoxious in the heap and on the execution >> stack not because they are representing small distinctions between code >> shape, but rather because they are often generated one per method >> handle. I intended this to be a temporary state, before 8 FCS, alas. >> We will be able to improve this state, soon, but it has been a long wait. >> >> One of the reasons it has been difficult to make the improvement is that, >> paradoxically, the existing over-production of lambda forms has some >> advantages, akin to the advantages of over-inlining. If your instruction >> processing pipeline can handle a 10x overload of redundant instructions, >> sometimes you win, because profiling and prediction tactics work better >> on single-use instructions, as opposed to shared-use instructions. >> This is true both at the bytecode and machine-code levels. >> >> Our major challenge is to preserve existing performance peaks >> while reducing the volume of IR and bytecodes going down the >> pipeline. The challenge is tied to the choice of IR, but not in >> such as radical way that we need an IR change. The real >> challenge, as we see it, is twofold. >> >> First, as method handles are built, recognize emergent similarities >> of structure, and reuse similar bits of IR to build similar method >> handles. Put negatively, don't build full custom IR for each >> method handle. Put positively, build generic IR if you can get >> away with it, or at least cache repeatedly built patterns, as possible. >> You know you have won when people stop noticing IR dregs >> in their heaps. >> >> Second, as method handles are executed (and/or as their IR or >> bytecodes are executed), collect profile data that will be useful >> to the eventual full-custom object code at any or all of the relevant >> invocation sites—or nowhere, if those sites don't exist. >> >> These goals are in tension with each other, regardless of what >> you call your IR. Exploring this tension has been a key component >> of our recent engineering. In basic terms, profile information should >> be attached as early as possible to the particular method handles >> bound at an invokedynamic site, while each chunk IR should be >> cagily shared among as many method handles as possible. >> >> (The previous requirement is expressed in terms of indy call >> sites, but it should extend to other kinds of method handle >> call sites. An indy call site is, in effect, an invokeExact of a >> compile-time constant method handle, but we also have >> optimizations for non-constant method handle invocations, >> both exact and inexact. Still, the focus is on indy.) >> >> We can use the "pipeline" metaphor to observe that overly long >> pipelines have drawbacks. If code must be dynamically translated >> into several different states before full throughput, it is possible >> that some intermediate state will introduce an irregularity that >> later phases will not be able to normalize or optimize. This is >> the case, for example, with the lambda form interpreter. This >> exists (a) to work around some bootstrap issues, and (b) to allow >> programs to warm a little before committing to bytecode. >> (A third advantage is an executable documentation of semantics.) >> It appears that neither benefit is strong enough to overcome the >> disadvantage of having the JIT stumble over the lambda form >> interpreter, so we are likely to remove that phase. >> >> It does not follow, however, that the other pipeline phases for >> lambda forms should also be removed. In fact, neither bytecodes >> nor bundles of some lower-level IR are a replacement for lambda >> forms, insofar as they fill a special niche, the need for an easily >> created representation of a stand-alone, ad hoc JVM method. >> We need a way to build small bits of behavior that can be >> assembled, method-wise, into larger bits of behavior. >> There may be better ways to build such things than lambda >> forms, but any proposal to eliminate lambda forms needs to >> provide an up-front design for a way to compose, analyze, >> cache, and load ad hoc methods. >> >> In the near future, we expect the overhead of a method handle to >> be more like the overhead of a closure and less like the overhead >> of an inner class. This means that you will win if you "compress" >> your call sites using indy instead of open-coding them with many >> bytecodes. The warm-up will process the compact form, leading >> to less net data motion for loading, while the hot form of the code >> will be indistinguishable from the open-coded version. >> >> And we expect to get better at capturing the call-site specific types, >> values, and branch behavior of the contents of an invokedynamic site. >> >> Such work is intended to replicate the hot-path effects of open-coded >> bytecodes, without their startup time bulk. It worked for JDK 8 lambdas. >> We intend it to work for our other customers. >> >> For example, one of our goals is that taking a generic method handle, >> shrinking it with "asType", and then binding it to an indy site, will >> end up having a customization effect similar to spinning customized >> bytecodes for the internals of the generic method handle. This will >> get us close to Fredrik's "rabbit" design, even if we don't handle >> arity polymorphism any time soon. Note that value types are pushing >> us in this direction also, since values effectively require the creation >> of many more customizations, in a system that cannot interchange >> boxed and unboxed representations. An early part of this work is >> carefully defining more optimizer-friendly semantics for primitive >> boxes, to be extended to value boxes. >> >> Moving forward, we expect invokedynamic to continue to provide a >> powerful way to reduce the complexity of bytecode shapes of advanced >> languages (including Java), without compromising on performance. >> >> I hope this helps. Please let me know if something is unclear or seems >> wrong; these are certainly slippery topics. >> >> Best wishes, >> — John >> _______________________________________________ >> mlvm-dev mailing list >> mlvm-dev@openjdk.java.net >> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev > > _______________________________________________ > mlvm-dev mailing list > mlvm-dev@openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev
_______________________________________________ mlvm-dev mailing list mlvm-dev@openjdk.java.net http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev