On Fri, 4 Mar 2022 17:44:44 GMT, Ludovic Henry <luhe...@openjdk.org> wrote:
>> Despite the hash value being cached for Strings, computing the hash still >> represents a significant CPU usage for applications handling lots of text. >> >> Even though it would be generally better to do it through an enhancement to >> the autovectorizer, the complexity of doing it by hand is trivial and the >> gain is sizable (2x speedup) even without the Vector API. The algorithm has >> been proposed by Richard Startin and Paul Sandoz [1]. >> >> Speedup are as follows on a `Intel(R) Xeon(R) E-2276G CPU @ 3.80GHz` >> >> >> Benchmark (size) Mode Cnt Score >> Error Units >> StringHashCode.Algorithm.scalarLatin1 0 avgt 25 2.111 >> ± 0.210 ns/op >> StringHashCode.Algorithm.scalarLatin1 1 avgt 25 3.500 >> ± 0.127 ns/op >> StringHashCode.Algorithm.scalarLatin1 10 avgt 25 7.001 >> ± 0.099 ns/op >> StringHashCode.Algorithm.scalarLatin1 100 avgt 25 61.285 >> ± 0.444 ns/op >> StringHashCode.Algorithm.scalarLatin1 1000 avgt 25 628.995 >> ± 0.846 ns/op >> StringHashCode.Algorithm.scalarLatin1 10000 avgt 25 6307.990 >> ± 4.071 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled16 0 avgt 25 2.358 >> ± 0.092 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled16 1 avgt 25 3.631 >> ± 0.159 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled16 10 avgt 25 7.049 >> ± 0.019 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled16 100 avgt 25 33.626 >> ± 1.218 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled16 1000 avgt 25 317.811 >> ± 1.225 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled16 10000 avgt 25 3212.333 >> ± 14.621 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled8 0 avgt 25 2.356 >> ± 0.097 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled8 1 avgt 25 3.630 >> ± 0.158 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled8 10 avgt 25 8.724 >> ± 0.065 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled8 100 avgt 25 32.402 >> ± 0.019 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled8 1000 avgt 25 321.949 >> ± 0.251 ns/op >> StringHashCode.Algorithm.scalarLatin1Unrolled8 10000 avgt 25 3202.083 >> ± 1.667 ns/op >> StringHashCode.Algorithm.scalarUTF16 0 avgt 25 2.135 >> ± 0.191 ns/op >> StringHashCode.Algorithm.scalarUTF16 1 avgt 25 5.202 >> ± 0.362 ns/op >> StringHashCode.Algorithm.scalarUTF16 10 avgt 25 11.105 >> ± 0.112 ns/op >> StringHashCode.Algorithm.scalarUTF16 100 avgt 25 75.974 >> ± 0.702 ns/op >> StringHashCode.Algorithm.scalarUTF16 1000 avgt 25 716.429 >> ± 3.290 ns/op >> StringHashCode.Algorithm.scalarUTF16 10000 avgt 25 7095.459 >> ± 43.847 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled16 0 avgt 25 2.381 >> ± 0.038 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled16 1 avgt 25 5.268 >> ± 0.422 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled16 10 avgt 25 11.248 >> ± 0.178 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled16 100 avgt 25 52.966 >> ± 0.089 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled16 1000 avgt 25 450.912 >> ± 1.834 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled16 10000 avgt 25 4403.988 >> ± 2.927 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled8 0 avgt 25 2.401 >> ± 0.032 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled8 1 avgt 25 5.091 >> ± 0.396 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled8 10 avgt 25 12.801 >> ± 0.189 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled8 100 avgt 25 52.068 >> ± 0.032 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled8 1000 avgt 25 453.270 >> ± 0.340 ns/op >> StringHashCode.Algorithm.scalarUTF16Unrolled8 10000 avgt 25 4433.112 >> ± 2.699 ns/op >> >> >> At Datadog, we handle a great amount of text (through logs management for >> example), and hashing String represents a large part of our CPU usage. It's >> very unlikely that we are the only one as String.hashCode is such a core >> feature of the JVM-based languages with its use in HashMap for example. >> Having even only a 2x speedup would allow us to save thousands of CPU cores >> per month and improve correspondingly the energy/carbon impact. >> >> [1] >> https://static.rainfocus.com/oracle/oow18/sess/1525822677955001tLqU/PF/codeone18-vector-API-DEV5081_1540354883936001Q3Sv.pdf > > Ludovic Henry has updated the pull request incrementally with one additional > commit since the last revision: > > Add UTF-16 benchmarks I think we should explore an intrinsic that also covers `Arrays.hashCode` too, as that will have the broader impact. It may be possible to express the intrinsic in say `ArraysSupport` as: @IntrinsicCandidate public long hashCode(Object o, int offset, int length, Class<?> type, boolean unsigned) { return 0; } then used as: public static int hashCode(byte[] a) { if (a == null) return 0; long v = ArraysSupport.hashCode(a, BYTE_OFFSET, a.length, byte.class, false); result = v & 0xFFFF_FFFF; int i = (int) (v >> 32) for (; i < a.length; i++) { result = 31 * result + a[i]; } return result; } Allowing for the intrinsic to return a "remainder" may simply its implementation (avoid having to do the tail). ------------- PR: https://git.openjdk.java.net/jdk/pull/7700