Re: RFR: 8282664: Unroll by hand StringUTF16, StringLatin1, and Arrays polynomial hash loops [v12]

Ludovic Henry Tue, 10 May 2022 07:47:27 -0700

> Despite the hash value being cached for Strings, computing the hash still 
> represents a significant CPU usage for applications handling lots of text.
> 
> Even though it would be generally better to do it through an enhancement to 
> the autovectorizer, the complexity of doing it by hand is trivial and the 
> gain is sizable (2x speedup) even without the Vector API. The algorithm has 
> been proposed by Richard Startin and Paul Sandoz [1].
> 
> Speedup are as follows on a `Intel(R) Xeon(R) E-2276G CPU @ 3.80GHz`
> 
> 
> Benchmark                                        (size)  Mode  Cnt     Score  
>   Error  Units
> StringHashCode.Algorithm.scalarLatin1                 0  avgt   25     2.111 
> ±  0.210  ns/op
> StringHashCode.Algorithm.scalarLatin1                 1  avgt   25     3.500 
> ±  0.127  ns/op
> StringHashCode.Algorithm.scalarLatin1                10  avgt   25     7.001 
> ±  0.099  ns/op
> StringHashCode.Algorithm.scalarLatin1               100  avgt   25    61.285 
> ±  0.444  ns/op
> StringHashCode.Algorithm.scalarLatin1              1000  avgt   25   628.995 
> ±  0.846  ns/op
> StringHashCode.Algorithm.scalarLatin1             10000  avgt   25  6307.990 
> ±  4.071  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled16       0  avgt   25     2.358 
> ±  0.092  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled16       1  avgt   25     3.631 
> ±  0.159  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled16      10  avgt   25     7.049 
> ±  0.019  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled16     100  avgt   25    33.626 
> ±  1.218  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled16    1000  avgt   25   317.811 
> ±  1.225  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled16   10000  avgt   25  3212.333 
> ± 14.621  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled8        0  avgt   25     2.356 
> ±  0.097  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled8        1  avgt   25     3.630 
> ±  0.158  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled8       10  avgt   25     8.724 
> ±  0.065  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled8      100  avgt   25    32.402 
> ±  0.019  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled8     1000  avgt   25   321.949 
> ±  0.251  ns/op
> StringHashCode.Algorithm.scalarLatin1Unrolled8    10000  avgt   25  3202.083 
> ±  1.667  ns/op
> StringHashCode.Algorithm.scalarUTF16                  0  avgt   25     2.135 
> ±  0.191  ns/op
> StringHashCode.Algorithm.scalarUTF16                  1  avgt   25     5.202 
> ±  0.362  ns/op
> StringHashCode.Algorithm.scalarUTF16                 10  avgt   25    11.105 
> ±  0.112  ns/op
> StringHashCode.Algorithm.scalarUTF16                100  avgt   25    75.974 
> ±  0.702  ns/op
> StringHashCode.Algorithm.scalarUTF16               1000  avgt   25   716.429 
> ±  3.290  ns/op
> StringHashCode.Algorithm.scalarUTF16              10000  avgt   25  7095.459 
> ± 43.847  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled16        0  avgt   25     2.381 
> ±  0.038  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled16        1  avgt   25     5.268 
> ±  0.422  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled16       10  avgt   25    11.248 
> ±  0.178  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled16      100  avgt   25    52.966 
> ±  0.089  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled16     1000  avgt   25   450.912 
> ±  1.834  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled16    10000  avgt   25  4403.988 
> ±  2.927  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled8         0  avgt   25     2.401 
> ±  0.032  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled8         1  avgt   25     5.091 
> ±  0.396  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled8        10  avgt   25    12.801 
> ±  0.189  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled8       100  avgt   25    52.068 
> ±  0.032  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled8      1000  avgt   25   453.270 
> ±  0.340  ns/op
> StringHashCode.Algorithm.scalarUTF16Unrolled8     10000  avgt   25  4433.112 
> ±  2.699  ns/op
> 
> 
> At Datadog, we handle a great amount of text (through logs management for 
> example), and hashing String represents a large part of our CPU usage. It's 
> very unlikely that we are the only one as String.hashCode is such a core 
> feature of the JVM-based languages with its use in HashMap for example. 
> Having even only a 2x speedup would allow us to save thousands of CPU cores 
> per month and improve correspondingly the energy/carbon impact.
> 
> [1] 
> https://static.rainfocus.com/oracle/oow18/sess/1525822677955001tLqU/PF/codeone18-vector-API-DEV5081_1540354883936001Q3Sv.pdf


Ludovic Henry has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains 18 commits:

 - Fix overlapping registers
 - Actually fix h when hashcode is vectorized
 - Merge branch 'master' of https://github.com/openjdk/jdk into 
vectorized-stringlatin1-hashcode
 - Fix h when vectorized for Arrays.hashCode
 - Add missing check for AryHashCode node
 - Fix some merge conflicts
 - Disable Arrays.hashCode intrinsic by default for CI
 - Merge branch 'master' of https://github.com/openjdk/jdk into 
vectorized-stringlatin1-hashcode
 - Some small refactoring: store power_of_31_backwards in the code directly, 
compact code, and more
 - {wip} Generalize string hashcode to Arrays.hashCode
 - ... and 8 more: https://git.openjdk.java.net/jdk/compare/3fa1c404...29dab16b

-------------

Changes: https://git.openjdk.java.net/jdk/pull/7700/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7700&range=11
  Stats: 1151 lines in 25 files changed: 1140 ins; 0 del; 11 mod
  Patch: https://git.openjdk.java.net/jdk/pull/7700.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/7700/head:pull/7700

PR: https://git.openjdk.java.net/jdk/pull/7700

Re: RFR: 8282664: Unroll by hand StringUTF16, StringLatin1, and Arrays polynomial hash loops [v12]

Reply via email to