Re: The Great Startup Problem

Vladimir Ivanov Mon, 01 Sep 2014 03:42:13 -0700

I'd like to focus on reducing amount of LambdaForm instances.

It benefits both dynamic memory footprint (less LambdaForms => lessheap/metaspace used) and warmup (less LambdaForms => less LFinstantiation/interpretation/bytecode translation).

After JVMLS we had a discussion on that topic and discussed short-termand mid-term plans to address the issue.

With JEP 210 [1] (targeted for 8u40) we do a first step and startsharing different types of LambdaForms on basic type level. Itconsiderably reduce amount of LambdaForms (up to 5x on Octane withNashorn), but their number is still quite large and potentiallyunbounded. So, post-JEP210 work is needed in JDK9 time frame.


There are 2 directions to address the problem:
  * improve sharing of LambdaForms
  * introduce specialized LambdaForm types

I'll discuss LF sharing below in details, but to illustratespecialization aspect, consider SwitchPoint case. Current encoding isquite heavy-weight (GWT + dynamic invoker on MCS). If SwitchPoint isextensively used, more compact encoding can give significant savings.

Regarding LF sharing, the ultimate goal for further reduction is to havea single implementation per combinator type.

To make that happen (and keep peak performance on the same level) weneed the following from JIT/VM:


(a) per method handle instance profiling (branch frequencies + type profile)
    to be able to share LambdaForms

(b) improved box/unbox elimination
    to be able to coerce all parameter types to Object

(c) array explosion
    to be able to place and pass parameters in an array

I'll use GuardWithTest (GWT) combinator to illustrate different cases.

(1) Originally, LF for GWT is implemented in the following way:

guard=Lambda(a0:L,a1:L,a2:L,a3:I,a4:D)=>{
  t5:I=MH1(a1:L,a2:L,a3:I,a4:D);
  t6:L=MethodHandleImpl.selectAlternative(t5:I, MH2, MH3);
  t7:L=MethodHandle.invokeBasic(t6:L,a1:L,a2:L,a3:I,a4:D);t7:L}

MH1, MH2 & MH3 are test, target, and fallback MethodHandles embeddedinto LambdaForm. It means each LambdaForm is specialized for particularMethodHandle and can't be shared.

(2) To allow sharing, the following shape is used (in JEP 210 [1]implementation):

guard=Lambda(a0:L,a1:L,a2:L,a3:I,a4:D)=>{
  t5:L=BoundMethodHandle$Species_L3.argL0(a0:L);
  t6:L=BoundMethodHandle$Species_L3.argL1(a0:L);
  t7:L=BoundMethodHandle$Species_L3.argL2(a0:L);
  t8:I=MethodHandle.invokeBasic(t5:L,a1:L,a2:L,a3:I,a4:D);
  t9:L=MethodHandleImpl.selectAlternative(t8:I, t6, t7);
  t10:L=MethodHandle.invokeBasic(t9:L,a1:L,a2:L,a3:I,a4:D);t10:L}

Test, target, and fallback MHs are stored in corresponding BMH. Itallows to share this LF among GWTs with the same erased signature (inthe example: (LLID)L).

At this point, sharing introduces the 1st problem - profile (both branchfrequencies and type profile) pollution. So, in order to avoid peakperformance regressions, we need to gather MH-specific profiling infoand feed it into JIT.

I have been playing with a solution for (a) and user-level VM hooks tofeed profile (gathered by application) into JIT looks promising (earlyprototype is here [4]).



(3) The next step is to specialize on arity basis: (LLID)L => (LLLL)L
guard=Lambda(a0:L,a1:L,a2:L,a3:L,a4:L)=>{
  t5:L=BoundMethodHandle$Species_L3.argL0(a0:L);
  t6:L=BoundMethodHandle$Species_L3.argL1(a0:L);
  t7:L=BoundMethodHandle$Species_L3.argL2(a0:L);
  t8:I=MethodHandle.invokeBasic(t5:L,a1:L,a2:L,a3:L,a4:L);
  t9:L=MethodHandleImpl.selectAlternative(t8:I, t6, t7);
  t10:L=MethodHandle.invokeBasic(t9:L,a1:L,a2:L,a3:L,a4:L);t10:L}

This version can be used for any GWT with arity 4. If necessary,parameters are boxed upper in MH chain and unboxed down MH chain. Toavoid performance degradation, all parameter boxing/unboxing operationsshould be reliably eliminated during JIT-compilation.

Current state of boxing/unboxing elimination in C2 doesn't cover all thecases. More work to improve the optimization or additional hints onbytecode level are needed.



(4) The last step is to switch to varargs version:
(LLID)L => (LLLL)L => (L)L

guard=Lambda(a0:L,a1:L)=>{
  t2:L=BoundMethodHandle$Species_L3.argL0(a0:L);
  t3:L=BoundMethodHandle$Species_L3.argL1(a0:L);
  t4:L=BoundMethodHandle$Species_L3.argL2(a0:L);
  t5:I=MethodHandle.invokeBasic(t2:L,a1:L);
  t6:L=MethodHandleImpl.selectAlternative(t5:I, t2, t3);
  t7:L=MethodHandle.invokeBasic(t6:L,a1:L);t7:L}

a2:L is an array containing all parameters (boxed, if necessary).Parameters are boxed and placed into an array upper in MH chain andextracted and unboxed before passing into target method. To keepperformance on the same level and be able to eliminate allboxing/unboxing pairs, JIT should be able to reliably "see" through theparamater array. "Frozen" arrays [2] (immutable + identity-less) aregood candidates to use here.



To summarize:

* (2) is what is targeted for 8u40; it requires a solution forprofile pollution I'm working on right now [4];* (3) & (4) require additional support from VM (enhanced box/unboxelimination, frozen arrays) and are _not_ targeted for 8u40. JDK-8054381[3] is filed to track this work.


Best regards,
Vladimir Ivanov

[1] https://bugs.openjdk.java.net/browse/JDK-8046703
[2] http://cr.openjdk.java.net/~jrose/pres/201407-JVMEvolution.pdf, slide 40
[3] https://bugs.openjdk.java.net/browse/JDK-8054381
[4] http://cr.openjdk.java.net/~vlivanov/profiling/

On 8/25/14, 1:32 PM, Marcus Lagergren wrote:

Regarding indy dense code:

It is certainly a problem both for JRuby with indy and Nashorn with indy that 
indy scalability is so bad in 9 builds with the current JITs. I suspect that as 
Java 8 grows as a code base and as a language, it will turn into a problem with 
Java 8 lambdas too. Nashorn generates a lot more code to pick the correct time 
and generate faster code, this means a lot more indys. This means a lot more 
lambdaforms. This means a lot more metaspace. This means a lot longer warmup. 
And lambdaform code that never really has a chance to be properly optimized - 
sometimes just simply because the JIT stops inlining, or sometimes because 
java.lang.invoke is full of boxing and arraycopies that simple don’t go away.

As Charlie pointed out, an invokedynamic callsite is generated as a seperate 
method in a separate class (albeit anonymous), which eventually loads up the 
metaspace with tremendous amounts of stuff. Sergey Kuksenko had a very 
interesting performance analysis presentation at JVMLS this year where ~41% of 
his runtime for Nashorn with octane.box2d was unlined lambda forms. And this is 
basically just the mechanisms pushing parameters and applying filters around 
the callsite. Seems like lambdaforms have to be treated specially (or rather 
indy callsites) by the JIT.

One solution that was proposed for 8u40 was JEP210 (lambda form caching), which 
does indeed keep footprint down, but performance suffers mightily since the 
same lambdaform snippet kan now be used at two completely different call sites, 
which brings us cache pollution. Vladimir is on vacation, but even though this 
brings metaspace down, I don’t think performance is back yet.

Even if everything inlinines correctly and deeply (which doesn’t happen all he 
time in C2 for long chains), we still have the problem of holding on to this 
synthetic bytecode/metaspace constructs for the LambdaForms. We really don’t 
want to have all this bookkeeping around something that can be as simple as 
permuting a couple of parameters (yes, it can be more complex, same argument 
applies)

LambdaForms were most likely introduced as a platform independent way of 
implementing methodhandle combinators in 8, because the 7 native implementation 
was not very stable, but it was probably a mistake to add them as “real” 
classes instead of code snippets that can just be spliced in around the 
callsite. (I completely lack history here, so flame me if I am wrong)

For both JRuby and Nashorn in the indy world, starting up a process generates 
bytecode where say every 5th to 10th instruction is an indy. Lambda code is not 
that bad, but it can also look pretty hairy. Now, if runtime linkage for each 
of these callsites requires metaspace, hidden bytecode generation, anonymous 
internal classes and the rest of the combinatorial explosion Charlie describes, 
we are setting us up for really bad scalability on such an arena. And metaspace 
of course, goes through the roof. Custom runtime linkage is still slow, but at 
least it only happens once.We don’t want to keep adding even more overhead o 
that.

For 9, it seems that we need a way to implement an indy that doesn’t result in 
class generation and installation of anonymous runtime classes. Note that 
_class installation_ as such is also a large overhead in the JVM - even more so 
when we regenerate more code to get more optimal types. I think we need to move 
from separate classes to inlined code, or something that requires minimium 
bookkeeping. I think this may be subject to profile pollution as well, but I 
haven’t been able to get my head around the ramifications yet.

There are various problems here as well (for example, several of the 
java.lang.invoke combinators create boxing and arrays and do arraycopies), 
stuff that would needed to be optimized away, or it’ll punish any indy call. In 
such an environment we can cheat with annotations like 
@ExplodeThisArrayToLocals or @NoSafePoint or similar magic annotations, because 
after all we own the code we splice in. (Solving local escape analysis, which 
is really the problem in the generic form of callsite IR, has so far not been 
very successful in C2), but even if C2 is a little bit legacy, making things 
like this hard, we might still be able to cheat for the limited world/range 
that is an indy callsite and teach C2 some magic. Lots of early performance 
problems Attila and I had in Nashorn were from e.g. 
MethodHandles.catchExcetption, that had to be rewritten to avoid boxing, but 
I’m talking about a more generic mechanism than this.

Having said this, I don’t think that we can solve the indy scalability problems 
in the current jits, without getting away from the class generatation/bytecode 
spewing that results from an indy callsite being compiled. Caching lambda forms 
brings the memory footprint down, but I am already quite worried that it will 
get nowhere near the performance that is needed, due to profile pollution.

If 9 is a platform that supports indy and runs c1 and c2, invokedynamic 
callsites in the JVM, at least in C2, would need some serious love - perhaps as 
described above.

Paul, Vladimir, Rickard - do you have any comments? We had a good discussion a 
couple of weeks ago about profiling callsites in SCA and what to do with such 
callsites. I’d prefer it if one of you guys write down a bit of our thoughts 
from that session, as I am again afraid of making a damn fool of myself among 
genius engineers on this list.  Also cc:ing Fredrik.

/M

On 25 Aug 2014, at 10:07, Jochen Theodorou <blackd...@gmx.org> wrote:

Am 24.08.2014 20:33, schrieb Charles Oliver Nutter:

On Sun, Aug 24, 2014 at 12:55 PM, Jochen Theodorou <blackd...@gmx.org> wrote:

afaik you can set how many times a lambda form has to be executed before it
is compiled... what happens if you set that very low... like 1 and disable
tiered compilation?


Forcing all handles to compiler early has the same negative
effect...most are only called once, and the overhead of reifying them
outweighs the cost of interpreting them.

I need to play with it more, though. The property I think you're
referring to did not appear to help us much.


I see it as a tradeoff. Yes, one-time-visited callsites may run even slower 
with this, but I think that is to be measured first. And secondly, you will be 
up to speed much faster than before, which can maybe outweight the initial 
cost. I am not saying 1 is an ideal value, but it should be played with.

We obviously still love working with OpenJDK, and it remains the best
platform for building JRuby (and other languages). However, our
failure as a community to address these startup/warmup issues is
eventually going to kill us. Startup time remains the #1 complaint
about JRuby, and warmup time may be a close second.


how do normal ruby startup times compare to JRuby for a rails app?


Perhaps 10x faster startup across the board in C Ruby. With tier 1 we
can get it down to 5x or so. It's incredibly frustrating for our
users.


I guess for a rails app that is indeed pretty bad.

All in all, the situation is for the Groovy world quite different I would
say.


I'd guess that developers in the Groovy world typically do all their
development in an IDE, which can keep a runtime environment available
all the time. Contrast this to pretty much everyone not from a Java or
C# background, where their IDE is a text editor and a command line.


Now I feel almost insulted ;) I get scolded so often, that I treat my IDE only 
as a better text editor... I agree in general though.
I think this is not so much a Groovy thing, as more a java thing though. If you 
do Grails, you do Spring+Apache most of the time. So you don't start a new 
server, you deploy to it. And even that may (in development mode) work by just 
keeping the class files in a certain directory. Unit testing is maybe 
different. But even there, you don't start a new JVM for each test. Maybe not 
even for each test suite. Groovy generally goes with the JVM instance here. 
Actually it is not even easily possible to spawn separate Groovy environments 
in the same JVM. In Grails a new environment might be spawned on a per suite 
base.

So yes, there are instances kept around, but imho this is already done from the 
Java world. We do nothing special here most of the time. But of course this is 
related to slow startup speeds of the JVM. groovy-core has around 7k tests, if 
for each of them we would have to create a new JVM it would easily take over an 
hour to execute. With Groovy startup included probably more than 6 hours.

Yes, this is a result of the great startup problem. But, the Java community 
finds ways around. The problem is that in JRuby you have to try to force a Ruby 
mechanism onto the JVM. And this works properly only if the JVM can behave as 
much as the Ruby as needed. And in regards to the startup times it does surely 
not.

bye Jochen

--
Jochen "blackdrag" Theodorou - Groovy Project Tech Lead
blog: http://blackdragsview.blogspot.com/
german groovy discussion newsgroup: de.comp.lang.misc
For Groovy programming sources visit http://groovy-lang.org

_______________________________________________
mlvm-dev mailing list
mlvm-dev@openjdk.java.net
http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev

_______________________________________________
mlvm-dev mailing list
mlvm-dev@openjdk.java.net
http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev

Re: The Great Startup Problem

Reply via email to