On 09/25/2014 06:00 PM, Vitaly Davidovich wrote:
Note that compiler does not use multiplication here. This is strength reduced into a bit shift followed by a subtraction (i.e. h * 31 = h * 32 (i.e. sal 5) - h).

I've noticed that in the disassembly files provided by Aleksey. Good to know, so one is not bothered to use bit shifts in situations where multiplication / division look clearer in source code.

The point about allowing execution units to run in parallel (given the break in data dependency) is very valid though.

For kicks and giggles, I looked at what gcc, clang, and icc generate for similar C code at O2 and O3, and did not observe them making any transformations to break the data dependency.

So HotSpot can break the ice here ;-)


While it would be cool if compiler could do this transformation on its own, as you say Peter, the scope of where this transform is applicable is somewhat narrow(?). If speeding up the non-cached String.hashCode() or Arrays.hashCode() methods is desired, perhaps intrinsics could be implemented (utility for String.hashCode seems low though). In that case, one could then do this transformation manually and also vectorize it for long inputs.

I see there are multiple options - each with it's own benefits and drawbacks. Hard to choose.

Regards, Peter


On Thu, Sep 25, 2014 at 7:09 AM, Peter Levart <peter.lev...@gmail.com <mailto:peter.lev...@gmail.com>> wrote:

    On 09/25/2014 09:40 AM, Aleksey Shipilev wrote:

        Hi Peter,

        On 09/25/2014 02:46 AM, Peter Levart wrote:

                    
http://cr.openjdk.java.net/~plevart/misc/StringHash/HashBench.java
                    
<http://cr.openjdk.java.net/%7Eplevart/misc/StringHash/HashBench.java>

        Interesting.

        I have to say it once again:
          a) It is *an error* to use static finals as the benchmark
        input. They
        are perfectly constant-foldable in way too many cases. Break
        this habit,
        please.


    Hi Aleksey,

    The "constant" in this example is only a reference to the char[]
    array. It's content is not. In String, this is a final instance
    field, which behaves similarly inside an instance method (usually
    it is read only once per method invocation).

          b) Explicit Blackholes are not needed, and just returning
        "int" from
        @Benchmark method helps readability a lot. Please break this
        habit as
        well. Having readable and maintainable benchmarks is a key.


    Ok, here's a modified benchmark:

    http://cr.openjdk.java.net/~plevart/misc/StringHash/HashBench2.java 
<http://cr.openjdk.java.net/%7Eplevart/misc/StringHash/HashBench2.java>

    Which behaves similarly.

    Here are it's results:

     * Results on JDK9, Linux, i7-2600K CPU, JMH args: -f 1 -wi 5 -i 8
    -gc true
     *
     * Benchmark                     Mode  Samples  Score  Score
    error  Units
     * j.t.HashBench2._hashCode     thrpt        8 8308858.217
    <tel:8308858.217> 353019.084  ops/s
     * j.t.HashBench2.hashCode0     thrpt        8   8207337.729
    217048.634  ops/s
     * j.t.HashBench2.hashCode1     thrpt        8  13359572.359
    345736.675  ops/s
     * j.t.HashBench2.hashCode2     thrpt        8  15310621.202
    238369.017  ops/s
     * j.t.HashBench2.hashCode3     thrpt        8 17637944.829
    <tel:17637944.829> 232155.847  ops/s
     * j.t.HashBench2.hashCode3i    thrpt        8 17724181.444
    <tel:17724181.444> 509913.288  ops/s
     * j.t.HashBench2.hashCode3x    thrpt        8   8344128.432
    159508.813  ops/s
     * j.t.HashBench2.hashCode4     thrpt        8  16526850.489
    969549.448  ops/s
     * j.t.HashBench2.hashCode5     thrpt        8  17567765.554
    917934.885  ops/s
     * j.t.HashBench2.hashCode6     thrpt        8 17705074.332
    <tel:17705074.332> 419405.652  ops/s
     * j.t.HashBench2.hashCode7     thrpt        8  18805633.563
    209181.299  ops/s
     * j.t.HashBench2.hashCode8     thrpt        8  18300123.201
    376681.550  ops/s

    It would be interesting to see how it behaves on different CPUs.


                This is really great!

                Couldn't this be a tweak via HotSpot, instead
                uglifying and bloating
                the Java and hence the byte code?

        +1

            This is for HotSpot compiler guys to answer. Theoretically
            I think it is
            possible. But it would have to be tailored to the very
            specific use case
            and I don't know if such specific transformation would
            have wide-enough
            applicability. If it would only speed-up String.hashCode
            and very
            similar loops, it is less trouble to do that by hand in
            one or few
            places...

        I would think this happens in user-specified hashCode() over
        arrays.
        IDEs would routinely inject the loop like that or delegate to
        Arrays.hashCode, that does the same loop.


    Arrays.hasCode() can be "improved" this way too.


        In other words, I would like to see this fixed on compiler
        side. This
        seems to be the strength-reduction playing tricks with loop
        unrolling,
        I'll submit a bug shortly.


    As I said, I don't think it has anything to do with loop
    unrolling. The transformation I applied in hashCode1,2,3, ... 8
    produces code that executes 2, 3, 4, ... 9 independent
    multiplications in each chunk, which allow CPU's pipeline to
    execute them in parallel. I had to manually unroll the loop a bit
    just to achieve this transformation. But my manual unrolling does
    not bring the speed-up per se. The parallelization of
    multiplication does. This can be seen by observing the score of
    hashCode3x benchmark, which has the same loop structure as
    hashCode3, but performs multiplications in a way where each of
    them depends on the result of a previous one, which prevents the
    CPU from parallelizing them.

    This is not to say that such transformation couldn't be done on
    the JIT side. I just have a feeling that such transformation won't
    be widely used because it is very specific. It can only be used
    within integer arithmetic of the homogeneous width (since it
    changes the order of operations applied, the final result depends
    on which width is used when overflow happens). Floating arithmetic
    is equally unsiutable for such transformations that change order
    of operations. It can only help when the sequence of operations
    that are dependent on one another are changed into a sequence of
    independent operations and those operations have a weight that
    matters (such as multiplication).

    Regards, Peter


        -Aleksey.





Reply via email to