On 09/25/2014 06:00 PM, Vitaly Davidovich wrote:
Note that compiler does not use multiplication here. This is strength
reduced into a bit shift followed by a subtraction (i.e. h * 31 = h *
32 (i.e. sal 5) - h).
I've noticed that in the disassembly files provided by Aleksey. Good to
know, so one is not bothered to use bit shifts in situations where
multiplication / division look clearer in source code.
The point about allowing execution units to run in parallel (given
the break in data dependency) is very valid though.
For kicks and giggles, I looked at what gcc, clang, and icc generate
for similar C code at O2 and O3, and did not observe them making any
transformations to break the data dependency.
So HotSpot can break the ice here ;-)
While it would be cool if compiler could do this transformation on its
own, as you say Peter, the scope of where this transform is applicable
is somewhat narrow(?). If speeding up the non-cached
String.hashCode() or Arrays.hashCode() methods is desired, perhaps
intrinsics could be implemented (utility for String.hashCode seems low
though). In that case, one could then do this transformation manually
and also vectorize it for long inputs.
I see there are multiple options - each with it's own benefits and
drawbacks. Hard to choose.
Regards, Peter
On Thu, Sep 25, 2014 at 7:09 AM, Peter Levart <peter.lev...@gmail.com
<mailto:peter.lev...@gmail.com>> wrote:
On 09/25/2014 09:40 AM, Aleksey Shipilev wrote:
Hi Peter,
On 09/25/2014 02:46 AM, Peter Levart wrote:
http://cr.openjdk.java.net/~plevart/misc/StringHash/HashBench.java
<http://cr.openjdk.java.net/%7Eplevart/misc/StringHash/HashBench.java>
Interesting.
I have to say it once again:
a) It is *an error* to use static finals as the benchmark
input. They
are perfectly constant-foldable in way too many cases. Break
this habit,
please.
Hi Aleksey,
The "constant" in this example is only a reference to the char[]
array. It's content is not. In String, this is a final instance
field, which behaves similarly inside an instance method (usually
it is read only once per method invocation).
b) Explicit Blackholes are not needed, and just returning
"int" from
@Benchmark method helps readability a lot. Please break this
habit as
well. Having readable and maintainable benchmarks is a key.
Ok, here's a modified benchmark:
http://cr.openjdk.java.net/~plevart/misc/StringHash/HashBench2.java
<http://cr.openjdk.java.net/%7Eplevart/misc/StringHash/HashBench2.java>
Which behaves similarly.
Here are it's results:
* Results on JDK9, Linux, i7-2600K CPU, JMH args: -f 1 -wi 5 -i 8
-gc true
*
* Benchmark Mode Samples Score Score
error Units
* j.t.HashBench2._hashCode thrpt 8 8308858.217
<tel:8308858.217> 353019.084 ops/s
* j.t.HashBench2.hashCode0 thrpt 8 8207337.729
217048.634 ops/s
* j.t.HashBench2.hashCode1 thrpt 8 13359572.359
345736.675 ops/s
* j.t.HashBench2.hashCode2 thrpt 8 15310621.202
238369.017 ops/s
* j.t.HashBench2.hashCode3 thrpt 8 17637944.829
<tel:17637944.829> 232155.847 ops/s
* j.t.HashBench2.hashCode3i thrpt 8 17724181.444
<tel:17724181.444> 509913.288 ops/s
* j.t.HashBench2.hashCode3x thrpt 8 8344128.432
159508.813 ops/s
* j.t.HashBench2.hashCode4 thrpt 8 16526850.489
969549.448 ops/s
* j.t.HashBench2.hashCode5 thrpt 8 17567765.554
917934.885 ops/s
* j.t.HashBench2.hashCode6 thrpt 8 17705074.332
<tel:17705074.332> 419405.652 ops/s
* j.t.HashBench2.hashCode7 thrpt 8 18805633.563
209181.299 ops/s
* j.t.HashBench2.hashCode8 thrpt 8 18300123.201
376681.550 ops/s
It would be interesting to see how it behaves on different CPUs.
This is really great!
Couldn't this be a tweak via HotSpot, instead
uglifying and bloating
the Java and hence the byte code?
+1
This is for HotSpot compiler guys to answer. Theoretically
I think it is
possible. But it would have to be tailored to the very
specific use case
and I don't know if such specific transformation would
have wide-enough
applicability. If it would only speed-up String.hashCode
and very
similar loops, it is less trouble to do that by hand in
one or few
places...
I would think this happens in user-specified hashCode() over
arrays.
IDEs would routinely inject the loop like that or delegate to
Arrays.hashCode, that does the same loop.
Arrays.hasCode() can be "improved" this way too.
In other words, I would like to see this fixed on compiler
side. This
seems to be the strength-reduction playing tricks with loop
unrolling,
I'll submit a bug shortly.
As I said, I don't think it has anything to do with loop
unrolling. The transformation I applied in hashCode1,2,3, ... 8
produces code that executes 2, 3, 4, ... 9 independent
multiplications in each chunk, which allow CPU's pipeline to
execute them in parallel. I had to manually unroll the loop a bit
just to achieve this transformation. But my manual unrolling does
not bring the speed-up per se. The parallelization of
multiplication does. This can be seen by observing the score of
hashCode3x benchmark, which has the same loop structure as
hashCode3, but performs multiplications in a way where each of
them depends on the result of a previous one, which prevents the
CPU from parallelizing them.
This is not to say that such transformation couldn't be done on
the JIT side. I just have a feeling that such transformation won't
be widely used because it is very specific. It can only be used
within integer arithmetic of the homogeneous width (since it
changes the order of operations applied, the final result depends
on which width is used when overflow happens). Floating arithmetic
is equally unsiutable for such transformations that change order
of operations. It can only help when the sequence of operations
that are dependent on one another are changed into a sequence of
independent operations and those operations have a weight that
matters (such as multiplication).
Regards, Peter
-Aleksey.