Re: [drlvm][jit][opt] HARMONY-3243 It's possible to fill array with constant much faster

Alex Astapchuk Sun, 18 Mar 2007 23:20:59 -0800

Hi Egor,

Egor Pasko wrote:

On the 0x29B day of Apache Harmony Alex Astapchuk wrote:

Hi Egor,


Thank you for you input. Please, find the answers inlined.

Egor Pasko wrote:

On the 0x29B day of Apache Harmony Alex Astapchuk wrote:

Hi Egor,

Thank you for your valuable comments. Please, find the answers inlined.

Egor Pasko wrote:

JIT guys,
...on the HARMONY-3243 patch I would like an open discussion in dev@
rather than a limited one in JIRA.
first of all, the latest faf_em64t.patch crashes for me with the
message:
java:
/home/xxx/svn/1/trunk/working_vm/vm/jitrino/src/shared/LoopTree.h:85:
bool Jitrino::LoopTree::hasLoops() const: Assertion `isValid()'
failed.
SIGABRT in VM code.
Stack trace:
addr2line: '[vdso]': No such file
(hangs here)

Could you please add the steps to reproduce into the JIRA?

done (maybe, my version is overly old, I'll check)

Thanks a lot!

next, with all great respect to Nikolay Sidelnikov for his efforts I
think, this idea needs a more careful analysis and probably
reimplementation.
The story is: 2 new optimization passes introduced: one in High-level
Optimizer nd one in Code Generator:
pass.1 recognizes a simple loop initialization with a simple pattern,
       substitutes it with a JIT helper call
pass.2 takes the corresponding call and substitutes it with a
       Low-Level-IR as it thinks the best code is
this should hypothetically improve one simple code pattern (that is
probably widely used?), i.e.: for (int i=0;i<I;i=++){A[i]=X;}
What I figured out looking at the patch:
* [pass.2] does not seem to throw any AIOutOfBoundsException

It moves bound checks outside of the loop and does the 'fast' loop
only when the check guarantees that no outofbounds happens. Otherwise,
it falls into a 'regular' version.

OK, but I still cannot not find the generation of code for the
regular
version in CG. Maybe, I a overlooking something. But, anyway,

Sure.
You can't find it in CG. It's in HLO.


oops, of course:) thanks. If the patch worked I would have figured it out.


Btw, in your stack trace provided, the revision number is 515555 which
is pretty old.
As far as I know, it's not reproducible on recent builds. Anyway, thanks
for pointing out to the issue.

[If I'm not mistaken] from what I see the improved variant gets added
after the bounds checks. Again - in HLO pass, not in CG.

yep

supplemented tests with various conditions of bounds checking are
absolutely necessary for this patch.

* [pass.2] does not have any useful tuning parameters such as number
           of unrolls per loop, thus the scheme eats potential benefit
           from loop unrolling and does not give any tuning back

The optimization is mostly orthogonal to the loop unrolling - see
also below.

by saying "orthogonal" you mean, loop unrolling together with this
arrayFilling can give more on a certain code sample than each of the

No.
What I meant is few lines below in the same message.

It's about more effective instructions and BBP overkill.

optimizations alone? it is not true. these do not complement each
other as far as I can see. But in case of abcd+versioning+unrolling+simple_fill
they could.
for simple_fill idea see below

* [pass.1] detects such a rare pattern that I doubt it would benefit a
           user (but obviously will benefit a you-know-which-benchmark
           runner)

It quite common for many string intensive applications, including
XML-oriented.
I would suppose it's even more common than 'static array initialization'
pattern that is explicitly handled in the translator:
you may have a look into
jitrino/src/translator/java/JavaByteCodeTranslator.cpp,
methods JavaByteCodeTranslator::newarray(), checkForArrayInitializer().

maybe..
to prove that, do you have a handy popular open source benchmark to
show this extreme space filling in XML-intensive comptations? are
there many XML documents of 1000 spaces in a row that make sense?

Please, see below.

* [pass.1] has a lot of new code that introduces potential instability

Could you please clarify more about instability - what do you mean exactly?

I mean, it needs more testing. More corner cases with AIOOBE at
least.

           (if the pattern was detected not properly, the code does
           not read easily), but does not contain a single unit test
           or the like. Together with AIOOBE issue stability becomes a
           real question.
* back branch polling is not performed (which is probably good for
  performance, but I would better have a tuning option)

BBP is useless (and even harmful) on short finite loops for various
reasons - it adds a memory access (a flag check), a branch, and a live
range. For hot short loops it's overkill.

in tp.java they are not quite short. And XML-intensive applications

'quite short' is 'quite subjective' :-)

Please, have a deeper look into the provided micro benchmark.
It has nothing to do with 1000 spaces in a row.

It works with *80 chars* length strings. I personally find it a good
average. If we find it productive, we may examine the same tests for
various lengths, but I bet it will work for the all the real strings.

trigger GC often, and GC needs this BBP to suspend faster. And, it is
not an obvios thing. I would prefer complementary optimizations that

Huh? Did I read it correctly?
Do you *really* mean that such short loops may affect a thread
suspension??

I don't believe this is true.

Such short loops take just dozens of clockticks to finish. While the
thread suspension involves much more heavy code sequence including spin
loops, waiting on events and context switches. They are just not
comparable in time.
Please correct me if my estimates are wrong.


Alex, let's then put a limit of 80 (or
MAX_ALLOWED_CHARS_FOR_ARRAY_FILL) chars to this optimization. There is
no such limit currently.


Sorry, did not get it. For what reason?

do not kill each other. And tuning parameters for geneting algorithm
to choose non-obvious heuristics.

Throwing away the BBP is one of the goals for the optimization...

loop finiteness is easily detected in this case. And I would prefer
more general algorithms that optimize a variety of situations. This
specific optimiztion detects only one specific case, has a lot of code
and not properly tested. Are we still targeting on stability?

Hmmmm.... Could you please be more specific? 'lot of code' is kind of
emotional characteristic. :-)

What makes you think it was not tested properly?


no tests in JIRA


There are many other tests performed during a feature preparation.
These include 'build test', mini- and other benchmarks and other apps.

Also, as you see your quick-look presumption about missed OutOfBounds
check was also incorrect - all the existing machinery is still on place.
But if you have any suggestion on better testing in general, then it's
always welcome.

Bugs happen sometimes, it's sad. :-(


I probably know :)

It was really helpful you discovered the problem before it goes into
the main trunk.


Alex, sorry if I was too emotional. My bad. I think, it is not a good
place here for emotions like these. But let's stick to more
engineering subjects.

What I can say more is that a good "ABCD" optimization complimented

... and ABCD does nothing with BBP, isn't it?

with "loop versioning" optimiztion will make a more readable, more
stable code, AND will give a better performance gain (loop unrolling

Unfortunately no, it will not. :-(

As you probably know ;-), char is 16 bits wide in Java.

I know, probably :)

The code generated for char moves have to use 16 bit movement
instructions. These instructions include operand-size change prefix
(66h) that makes CPU decoder feel bad.

yukes
Is this on super-duper Intel Core Whatever machines with decoders so
badly feeling or it is about those somewhere living 486 processors?


It's true for many of modern processors. [I suppose] they have to
support such instructions for the tones of existing legacy code.
While providing a much better alternatives for modern compilers like
ICL or Jitrino.
If you're really interested in details, please refer to
the "Intel(R) 64 and IA-32 Architectures Optimization Reference Manual".
The link is at [1], and you may be interested in #3.4.2.3
"Lenght-Changing prefixes (LCP)".


1. interesting reading. Pipeline stalls due to instruction
   decoding. (I will try to be less emotional than I am now) In my
   humble opinion we are writing to memory here. That is the place to
   stall. Instruction decoding is generally somewhat 10 times faster
   (if not 100).


Ah, these humble opinions... They are so human. :-)
Modern out-of-order processors like Intel's Core 2 Duo, are really hard
to predict having only a common sense and a single topic from manual.
Science begins where measurement begins.
Vtune may become your best friend here.

2. did you measure the actual difference?


YES!
Reading of JIRA points out few times speedup for the micro benchmark.
As it's common for micro benchmarks, YMMV on real apps depending on
characteristics of workload.

3. #3.4.2.4 says that decoded insts are used for small loops. Is the
   optimized sequence that large? more than 18 instructions? Well, I
   did not count, maybe, maybe. But for number-crunching loops, not
   memory accesses.


I was told that optimizations work whether you believe in them or not. :-)

Whatever unrolling or versioning
would leave these heavy instructions on place.

If so, write a 5 line optimization pass that replaces the operation
encoding in the (unrolled!) loops. In this case the optimization will
be a) more general/useful/complementing-each-other b) more readable.

It's an interesting idea. It would be really interesting to see such a
pass implemented and also measure it against HARMONY-3243.


I would love to implement versioning. Maybe, not so fast though..


Hmmm... I somehow recall someone said 'piece of cake'?
For HLO-oriented guy like you? ;-)

Anyway, please don't treat my joke wrong - feel free to submit it when
it's convenient for you.

One of the goals was to throw the heavy instructions away and replace
them with more effective and fast ones. It's somehow hard to do in
codegen separately, but much easy (and clearly) with a little help from
the HLO side - this is rationale for the HLO pass.

Okay, now I begin understanding the whole motivation. Thanks! Could
be
in JIRA, btw.
And now you have to support the helper in all codegenerators. I am
not

Not really. As you probably know ;-) the optimization selection scheme
in Jitrino is extremely flexible. If a codegen does not need a
particular optimization, it may be turned off as easy as typing a single
'-' in .emconf.


Ah, of course. I am somewhat stupid today, and probably not just today :)

so happy. My scheme is more simple and portable.

... but it does not exist in the code ;-) [yet?] ...


thanks for pointing this out :) I probably know.


My pleasure. That's what friends are for.

is awake too). Setting aside the fact that the overall design will be
more straightforward (having no interdependent passes, extra helpers, etc)
So I vote for focusing on ABCD plus "loop versioning" and leaving
specific benchmark-oriented tricks (complicating our design) alone.

Again, the optimization is orthogonal to the ABCD and was never
positioned as replacement for ABCD. Optimization's main target are
string (and XML) intensive apps.

thanks for explanation

An experienced hacker would say that all compiler reasearch is a bunch
of hacks that influence each other in unexpected ways. Well, maybe,
but I do not like this particular hack (with all respect to Nikolay
for his efforts)

In no way this is a hack. If you see a better room for the
improvements here, patches are always welcome. :-)

I would prefer if we spent time on more usable approaches as loop
versioning that take almost the same time to implement, are more
portable, more readable, and solve a wider variety of optimization
opportunities.

... Seriously, I don't see any reason why to oppose one optimization
against another.


if it makes design more complicated? more bugs that happen? I would

> not say these words in general (for a production quality piece of code).

Did you said 'design'? How comes that a single simple optimization pass
affects a whole Jitrino's design? You must be kidding...

Could you please be more specific on what you are talking about?
All these 'more complicated', etc are too loosely. If you have any
particular suggestion it would be nice to discuss, rather than talking
about abstract general wordings. Or it would be even better to see a
patch on what you think need an update and why.

And sorry - I did not get about the bugs.
Bugs can be easily fixed (if any) - what's the problem with it?

Surely, it would be just wonderful to have all optimizations at once.

I really mean 'all' - 'every single possible optimization'.
It would really great.

yep

Thought the software development (as you probably know ;-) ) is
iterative and evolutionary process.


there are evolutions and evolutions

I am not trying to convince anyone to stop evolution, no. I wanted to
emphasize that we can look behind and see that another approach would
(probably) have been better and almost would have been almost as hard


In this 'probably better' the key is 'probably', and 'better' is far
from proven. :-)
Luckily, we still have a chance to see this 'probably' with all the
highest standards, and even compare it with the existing things.

to implement. But unfortunately we had no discussion on that in the
list. Just throwing patches..

As long as your (or anyone other's) proposals are implemented - if
they make any of existing optimizations redundant - then what's the
problem? We may easily switch to the better alternative.
The more alternatives we have the better.

+1

I'm glad to see you are not proposing to throw the *existing* and
*working* code in favour of surely interesting, but not implemented
ideas.

I also suggest to commit the existing HARMONY-3243.


now it's implemented. I see no barriers to accept it. (some tests,
please) I needed the discussion. It was productive. Thanks.

We are all working on the steps forward.
When (in fact, if) we'll get a better alternative, we may choose
any particular variant we find useful.

And thanks again for your valuable input.


and please do not be as emotional as I am :)

Here is the optimization guide I talked before:

[1] http://www.intel.com/design/processor/manuals/248966.pdf

--
Yours faithfully,
      Alex

This "improvement" looks like non-fair covering the gap in
you-know-which benchmark with a quick hack and without testing.
Well, now it is implemented and already gives some benefit, I can
agree to let it go. But let's implement versioning, improve unrolling
and throw this 'arrayFiiling' away.


is it your permanent subscript? how about a copyright? :)


It's all yours.
And it's not a part of my subscript - I could not even try to steal
these words of wisdom from you.


--
Yours faithfully,
     Alex

Re: [drlvm][jit][opt] HARMONY-3243 It's possible to fill array with constant much faster

Reply via email to