Aleksey Shipilev wrote:
Hi again, Tim.
So I spent another day for this issue. I've gathered the profile of
SPECjbb2005 and grepped out HashMap methods (okay, I had to disable
inline, so exact numbers differ from actual performance run):
Thanks for spending time to look into the issue Aleksey, it is much
appreciated.
Current implementation:
6.99% HashMap.findNonNullKeyEntry(Ljava/lang/Object;II)Ljava/util/HashMap$Entry;
0.61% HashMap.getEntry(Ljava/lang/Object;)Ljava/util/HashMap$Entry;
0.25% HashMap.get(Ljava/lang/Object;)Ljava/lang/Object;
---------------
7.86% Total
H5374:
6.01%
HashMap.findNonNullKeyEntryInteger(Ljava/lang/Object;II)Ljava/util/HashMap$Entry;
0.67%
HashMap.findNonNullKeyEntryLegacy(Ljava/lang/Object;II)Ljava/util/HashMap$Entry;
0.61% HashMap.getEntry(Ljava/lang/Object;)Ljava/util/HashMap$Entry;
0.42% HashMap.get(Ljava/lang/Object;)Ljava/lang/Object;
0.39% HashMap.findNonNullKeyEntry(Ljava/lang/Object;II)Ljava/util/HashMap$Entry;
---------------
7.05% Total
Percents are clocktick percents of entire workload.
So, profile shows that H5374 code is actually faster.
Then after talk with Sergey Kuksenko (that's a credit to him :)) I
tried to compare these two implementations without allocPrefetch,
which prefetches the memory for newly created objects and thus
inferring high cache pressure. allocPrefetch itself gives hu-u-uge
boosts, but can expose cache limitations for other optimizations. So,
with allocPrefetch disabled:
Windows x86
100.0% Harmony-clean
101.1% Harmony + H5374
Windows x86_64
100.0% Harmony-clean
100.5% Harmony + H5374
That's the boost I'm looking for! I wonder why such positive change as
manual unboxing changes L2 cache access patterns so it gives boosts in
normal mode and degradation in presence of high L2 cache user.
I had also remeasured all modes accurately, so let's have the
conclusion on this issue:
Windows x86:
100.0% [base] Harmony-clean
100.2% [+0.2%] Harmony-clean + H5374
88.6% [base] Harmony-clean - allocPrefetch
89.6% [+1%] Harmony-clean - allocPrefetch + H5374
Windows x86_64:
100.0% [base] Harmony-clean
100.1% [+0.1%] Harmony-clean + H5374
88.9% [base] Harmony-clean - allocPrefetch
89.3% [+0.5%] Harmony-clean - allocPrefetch + H5374
...measurement uncertainty is about 0.4%.
Basing on this data I would say this patch couldn't get much boost on
DRLVM, since DRLVM's optimizations do their job of scalarization just
fine. The patch should also increase cache locality and it seems to be
the case in absence of another L2 cache contributor. Let's add that
such specialization bloats code a little, and jump to conclusion that
from DRLVM side it would be better to keep patch out of trunk.
Fair enough (though it looks like a minor improvement, right?).
I'm happy to leave the patch out.
Can I go back a moment to hear about the scalar replacement technique in
Jitrino? Feel free to point me to some doc or code if that is easier.
As you know, my goal was to avoid the key dereferencing when searching
the hashmap by, as you say, unboxing the Integer and encoding the value
in the hashcode int field. The key field is still an object ptr to the
original Integer object which is required for answering the keySet etc.
So how does Jitrino both unbox the primitive and preserve the 'box' for
when it must be returned? [If you see what I mean, otherwise I'll try
and rephrase it]
There is one more possible opportunity - to tune up prefetch distance
in allocPrefetch, but that's a fragile thing to optimize.
Yeah, but no need to perform unnatural acts. We can leave it out if
there is no benefit to Harmony.
Regards,
Tim