On the 0x29B day of Apache Harmony Alex Astapchuk wrote:
> Hi Egor,
>
> Thank you for your valuable comments. Please, find the answers inlined.
>
> Egor Pasko wrote:
> > JIT guys,
> > ...on the HARMONY-3243 patch I would like an open discussion in dev@
> > rather than a limited one in JIRA.
> > first of all, the latest faf_em64t.patch crashes for me with the
> > message:
> > java:
> > /home/xxx/svn/1/trunk/working_vm/vm/jitrino/src/shared/LoopTree.h:85:
> > bool Jitrino::LoopTree::hasLoops() const: Assertion `isValid()'
> > failed.
> > SIGABRT in VM code.
> > Stack trace:
> > addr2line: '[vdso]': No such file
> > (hangs here)
>
> Could you please add the steps to reproduce into the JIRA?
done (maybe, my version is overly old, I'll check)
> > next, with all great respect to Nikolay Sidelnikov for his efforts I
> > think, this idea needs a more careful analysis and probably
> > reimplementation.
> > The story is: 2 new optimization passes introduced: one in High-level
> > Optimizer nd one in Code Generator:
> > pass.1 recognizes a simple loop initialization with a simple pattern,
> > substitutes it with a JIT helper call
> > pass.2 takes the corresponding call and substitutes it with a
> > Low-Level-IR as it thinks the best code is
> > this should hypothetically improve one simple code pattern (that is
> > probably widely used?), i.e.: for (int i=0;i<I;i=++){A[i]=X;}
> > What I figured out looking at the patch:
> > * [pass.2] does not seem to throw any AIOutOfBoundsException
>
> It moves bound checks outside of the loop and does the 'fast' loop
> only when the check guarantees that no outofbounds happens. Otherwise,
> it falls into a 'regular' version.
OK, but I still cannot not find the generation of code for the regular
version in CG. Maybe, I a overlooking something. But, anyway,
supplemented tests with various conditions of bounds checking are
absolutely necessary for this patch.
> > * [pass.2] does not have any useful tuning parameters such as number
> > of unrolls per loop, thus the scheme eats potential benefit
> > from loop unrolling and does not give any tuning back
>
> The optimization is mostly orthogonal to the loop unrolling - see
> also below.
by saying "orthogonal" you mean, loop unrolling together with this
arrayFilling can give more on a certain code sample than each of the
optimizations alone? it is not true. these do not complement each
other as far as I can see. But in case of abcd+versioning+unrolling+simple_fill
they could.
for simple_fill idea see below
> > * [pass.1] detects such a rare pattern that I doubt it would benefit a
> > user (but obviously will benefit a you-know-which-benchmark
> > runner)
>
> It quite common for many string intensive applications, including
> XML-oriented.
> I would suppose it's even more common than 'static array initialization'
> pattern that is explicitly handled in the translator:
> you may have a look into
> jitrino/src/translator/java/JavaByteCodeTranslator.cpp,
> methods JavaByteCodeTranslator::newarray(), checkForArrayInitializer().
maybe..
to prove that, do you have a handy popular open source benchmark to
show this extreme space filling in XML-intensive comptations? are
there many XML documents of 1000 spaces in a row that make sense?
> > * [pass.1] has a lot of new code that introduces potential instability
>
> Could you please clarify more about instability - what do you mean exactly?
I mean, it needs more testing. More corner cases with AIOOBE at least.
> > (if the pattern was detected not properly, the code does
> > not read easily), but does not contain a single unit test
> > or the like. Together with AIOOBE issue stability becomes a
> > real question.
> > * back branch polling is not performed (which is probably good for
> > performance, but I would better have a tuning option)
>
> BBP is useless (and even harmful) on short finite loops for various
> reasons - it adds a memory access (a flag check), a branch, and a live
> range. For hot short loops it's overkill.
in tp.java they are not quite short. And XML-intensive applications
trigger GC often, and GC needs this BBP to suspend faster. And, it is
not an obvios thing. I would prefer complementary optimizations that
do not kill each other. And tuning parameters for geneting algorithm
to choose non-obvious heuristics.
> Throwing away the BBP is one of the goals for the optimization...
loop finiteness is easily detected in this case. And I would prefer
more general algorithms that optimize a variety of situations. This
specific optimiztion detects only one specific case, has a lot of code
and not properly tested.
Are we still targeting on stability?
> > What I can say more is that a good "ABCD" optimization complimented
> ... and ABCD does nothing with BBP, isn't it?
>
> > with "loop versioning" optimiztion will make a more readable, more
> > stable code, AND will give a better performance gain (loop unrolling
> Unfortunately no, it will not. :-(
>
> As you probably know ;-), char is 16 bits wide in Java.
I know, probably :)
> The code generated for char moves have to use 16 bit movement
> instructions. These instructions include operand-size change prefix
> (66h) that makes CPU decoder feel bad.
yukes
Is this on super-duper Intel Core Whatever machines with decoders so
badly feeling or it is about those somewhere living 486 processors?
> Whatever unrolling or versioning
> would leave these heavy instructions on place.
If so, write a 5 line optimization pass that replaces the operation
encoding in the (unrolled!) loops. In this case the optimization will
be a) more general/useful/complementing-each-other b) more readable.
> One of the goals was to throw the heavy instructions away and replace
> them with more effective and fast ones. It's somehow hard to do in
> codegen separately, but much easy (and clearly) with a little help from
> the HLO side - this is rationale for the HLO pass.
Okay, now I begin understanding the whole motivation. Thanks! Could be
in JIRA, btw.
And now you have to support the helper in all codegenerators. I am not
so happy. My scheme is more simple and portable.
> > is awake too). Setting aside the fact that the overall design will be
> > more straightforward (having no interdependent passes, extra helpers, etc)
> > So I vote for focusing on ABCD plus "loop versioning" and leaving
> > specific benchmark-oriented tricks (complicating our design) alone.
>
> Again, the optimization is orthogonal to the ABCD and was never
> positioned as replacement for ABCD. Optimization's main target are
> string (and XML) intensive apps.
thanks for explanation
> > An experienced hacker would say that all compiler reasearch is a bunch
> > of hacks that influence each other in unexpected ways. Well, maybe,
> > but I do not like this particular hack (with all respect to Nikolay
> > for his efforts)
>
> In no way this is a hack. If you see a better room for the
> improvements here, patches are always welcome. :-)
I would prefer if we spent time on more usable approaches as loop
versioning that take almost the same time to implement, are more
portable, more readable, and solve a wider variety of optimization
opportunities.
This "improvement" looks like non-fair covering the gap in
you-know-which benchmark with a quick hack and without testing.
Well, now it is implemented and already gives some benefit, I can
agree to let it go. But let's implement versioning, improve unrolling
and throw this 'arrayFiiling' away.
--
Egor Pasko