On Fri, 4 Mar 2022 17:44:44 GMT, Ludovic Henry <luhe...@openjdk.org> wrote:

>> Despite the hash value being cached for Strings, computing the hash still 
>> represents a significant CPU usage for applications handling lots of text.
>> 
>> Even though it would be generally better to do it through an enhancement to 
>> the autovectorizer, the complexity of doing it by hand is trivial and the 
>> gain is sizable (2x speedup) even without the Vector API. The algorithm has 
>> been proposed by Richard Startin and Paul Sandoz [1].
>> 
>> Speedup are as follows on a `Intel(R) Xeon(R) E-2276G CPU @ 3.80GHz`
>> 
>> 
>> Benchmark                                  (size)  Mode  Cnt     Score   
>> Error  Units
>> StringHashCode.Algorithm.scalar                 0  avgt          1.799       
>>    ns/op
>> StringHashCode.Algorithm.scalar                 1  avgt          3.233       
>>    ns/op
>> StringHashCode.Algorithm.scalar                10  avgt          6.617       
>>    ns/op
>> StringHashCode.Algorithm.scalar               100  avgt         61.481       
>>    ns/op
>> StringHashCode.Algorithm.scalar              1000  avgt        631.690       
>>    ns/op
>> StringHashCode.Algorithm.scalar             10000  avgt       6293.611       
>>    ns/op
>> StringHashCode.Algorithm.scalarUnrolled8        0  avgt          1.890       
>>    ns/op
>> StringHashCode.Algorithm.scalarUnrolled8        1  avgt          3.494       
>>    ns/op
>> StringHashCode.Algorithm.scalarUnrolled8       10  avgt          9.050       
>>    ns/op
>> StringHashCode.Algorithm.scalarUnrolled8      100  avgt         31.725       
>>    ns/op
>> StringHashCode.Algorithm.scalarUnrolled8     1000  avgt        321.031       
>>    ns/op
>> StringHashCode.Algorithm.scalarUnrolled8    10000  avgt       3203.838       
>>    ns/op
>> StringHashCode.Algorithm.scalarUnrolled16       0  avgt          1.953       
>>    ns/op
>> StringHashCode.Algorithm.scalarUnrolled16       1  avgt          3.485       
>>    ns/op
>> StringHashCode.Algorithm.scalarUnrolled16      10  avgt          7.041       
>>    ns/op
>> StringHashCode.Algorithm.scalarUnrolled16     100  avgt         30.975       
>>    ns/op
>> StringHashCode.Algorithm.scalarUnrolled16    1000  avgt        316.616       
>>    ns/op
>> StringHashCode.Algorithm.scalarUnrolled16   10000  avgt       3208.658       
>>    ns/op
>> 
>> 
>> At Datadog, we handle a great amount of text (through logs management for 
>> example), and hashing String represents a large part of our CPU usage. It's 
>> very unlikely that we are the only one as String.hashCode is such a core 
>> feature of the JVM-based languages with its use in HashMap for example. 
>> Having even only a 2x speedup would allow us to save thousands of CPU cores 
>> per month and improve correspondingly the energy/carbon impact.
>> 
>> [1] 
>> https://static.rainfocus.com/oracle/oow18/sess/1525822677955001tLqU/PF/codeone18-vector-API-DEV5081_1540354883936001Q3Sv.pdf
>
> Ludovic Henry has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Add UTF-16 benchmarks

May you print out the generated code? My wild guess is that the updated version 
is still scalar, the improvement comes from dependency breakdown. I suggest 
hoisting the accumulation out of the main loop to achieve maximal scalar 
throughput. A small experiment on my machine shows improvement over your 
approach.

    int c0 = 0, c1 = 0, c2 = 0, c3 = 0;
    int i = 0, len = value.length;
    for (; i < (len & (~3)); i += 4) {
        c3 = c3 * 92351 + (value[i + 3] & 0xff);
        c2 = c2 * 92351 + (value[i + 2] & 0xff);
        c1 = c1 * 92351 + (value[i + 1] & 0xff);
        c0 = c0 * 92351 + (value[i] & 0xff);
    }
    int h = c3 * 29791 + c2 * 961 + c1 * 31 + c0;
    for (; i < len; i++) {
        h = h * 31 + (value[i] & 0xff);
    }
    return h;

Thanks a lot.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7700

Reply via email to