Hello again...

For what occurs in modern CPUs, well, no. Surprising to see that a mul
would be faster than a shr or shl. How comes?

No, not faster.  But definitely way faster than >30 cycles not so long ago.

I used to be ok with these things when I was writing demoscene code a
loong time ago but I'd need an extremely serious refresh.

You said "demoscene"?!... do you have a handle???

Andres.

As a side note, there is a huge uptake in the BigData/mapreduce/hadoop
environment where Smalltalk is sorely absent. Scala seems to fill the
void on the JVM.
There hashing is quite key, to remap all of the mapping phase results to
the reduce nodes. I am surprised to see that Smalltalk vendors haven't
jumped in that space.

Phil


On Wed, Feb 26, 2014 at 2:04 AM, Andres Valloud
<avall...@smalltalk.comcastbiz.net
<mailto:avall...@smalltalk.comcastbiz.net>> wrote:

    Hello...

    On 2/25/14 1:17 , p...@highoctane.be <mailto:p...@highoctane.be> wrote:

        I am currently reading through the Hashing in Smalltalk book
        
(http://www.lulu.com/shop/__andres-valloud/hashing-in-__smalltalk-theory-and-practice/__paperback/product-3788892.html
        
<http://www.lulu.com/shop/andres-valloud/hashing-in-smalltalk-theory-and-practice/paperback/product-3788892.html>__)
        and, my head hurting notwithstanding, there are indeed a ton of
        gems in
        this system. As he mentions, doing the exercises brings a lot of
        extra :-)


    :) thank you.

        When going to 64-bit, and with the new ObjectMemory scheme, I
        guess a
        couple of identity hashing functions will come under scrutiny.

        e.g.

        SmallInteger>>hashMultiply
        | low |

        low := self bitAnd: 16383.
        ^(16r260D * low + ((16r260D * (self bitShift: -14) + (16r0065 * low)
        bitAnd: 16383) * 16384))
        bitAnd: 16r0FFFFFFF


        which will need some more bits.


    IMO it's not clear that SmallInteger>>identityHash should be
    implemented that way.  Finding a permutation of the small integers
    that also behaves like a good quality hash function and evaluates
    quickly (in significantly less time and complexity than, say, Bob
    Jenkins' lookup3) would be a really interesting research project.  I
    don't know if it's possible.  If no such thing exists, then getting
    at least some rough idea of what's the minimum necessary complexity
    for such hash functions would be valuable.

    Looking at hashMultiply as a non-identity hash function, one would
    start having problems when significantly more than 2^28 objects are
    stored in a single hashed collection.  2^28 objects with e.g. 12
    bytes per header and a minimum of one instance variable (so the hash
    value isn't a instance-constant) stored in a hashed collection
    requires more than 4gb, so that is clearly a 64 bit image problem.
      In 64 bits, 2^28 objects with e.g. 16 bytes per header and a
    minimum of one instance variable each is already 6gb, and a minimum
    of 8gb with the hashed collection itself.

    Because of these figures, I'd think improving the implementation of
    hashed collections takes priority over adding more non-identity hash
    function bits (as long as the existing hash values are of good quality).

    Did you look at the Hash Analysis Tool I wrote?  It's in the Cincom
    public Store repository.  It comes in two bundles: Hash Analysis
    Tool, and Hash Analysis Tool - Extensions.  With everything loaded,
    the tool comes with 300+ hash functions and 100+ data sets out of
    the box.  The code is MIT.

        I had a look at how it was done in VisualWorks;


    The implementation of hashMultiply, yes.  Note however that
    SmallInteger>>hash is ^self.

        hashMultiply
        "Multiply the receiver by 16r0019660D mod 2^28
        without using large integer arithmetic for speed.
        The constant is a generator of the multiplicative
        subgroup of Z_2^30, see Knuth's TAOCP vol 2."
        <primitive: 1747>
        | low14Bits |
        low14Bits := self bitAnd: 16r3FFF.
        ^16384
        * (16r260D * (self bitShift: -14) + (16r0065 * low14Bits)
        bitAnd: 16r3FFF)
        + (16r260D * low14Bits) bitAnd: 16rFFFFFFF

        The hashing book version has:

        multiplication
        "Computes self times 1664525 mod 2^38 while avoiding overflow into a
        large integer by making the multiplication into two 14 bits
        chunks. Do
        not use any division or modulo operation."
        | lowBits highBits|

        lowBits := self bitAnd: 16r3FFF.
        highBits := self bitShift: -14.
        ^(lowBits * 16r260D)
            + (((lowBits * 16r0065) bitAnd: 16r3FFF) bitShift: 14)
            + (((highBits * 16r260D) bitAnd: 16r3FFF) bitShift: 14)
        bitAnd: 16rFFFFFFF

        So, 16384 * is the same as bitShift: 14 and it looks like done once,
        which may be better.


    It should be a primitive (or otherwise optimized somehow).  At some
    point though that hash function was implemented for e.g. ByteArray
    in Squeak, I thought at that point the multiplication step was also
    implemented as a primitive?

        Also VW marks it as a primitive, which Pharo does not.


    In VW it is also a translated primitive, i.e. it's executed directly
    in the JIT without calling C.

    Keep in mind that the speed at which hash values are calculated is
    only part of the story.  If the hash function quality is not great,
    or the hashed collection implementation is not efficient and induces
    collisions or other extra work, improving the efficiency of the hash
    functions won't do much.  I think it's mentioned in the hash book
    (I'd have to check), but once I made a hash function 5x times slower
    to get better quality and the result was that a report that was
    taking 30 minutes took 90 seconds instead (and hashing was gone from
    the profiler output).

        Would we gain
        some speed doing that? hashMultiply is used a lof for identity
        hashes.

        Bytecode has quite some work to do:

        37 <70> self
        38 <20> pushConstant: 16383
        39 <BE> send: bitAnd:
        40 <68> popIntoTemp: 0
        41 <21> pushConstant: 9741
        42 <10> pushTemp: 0
        43 <B8> send: *
        44 <21> pushConstant: 9741
        45 <70> self
        46 <22> pushConstant: -14
        47 <BC> send: bitShift:
        48 <B8> send: *
        49 <23> pushConstant: 101
        50 <10> pushTemp: 0
        51 <B8> send: *
        52 <B0> send: +
        53 <20> pushConstant: 16383
        54 <BE> send: bitAnd:
        55 <24> pushConstant: 16384
        56 <B8> send: *
        57 <B0> send: +
        58 <25> pushConstant: 268435455
        59 <BE> send: bitAnd:
        60 <7C> returnTop


    If this is a primitive instead, then you can also avoid the overflow
    into large integers and do the math with (basically)

    mov eax, smallInteger
    shr eax, numberOfTagBits
    mul eax, 1664525 "the multiplication that throws out the high bits"
    shl eax, 4 "throw out bits 29-32"
    shr eax, 4
    lea eax, [eax * 2^numberOfTagBits + smallIntegerTagBits]

    Please excuse trivial omissions in the above, it's written only for
    the sake of illustration (e.g. it looks like the 3 last instructions
    can be combined into two... lea followed by shr).  Also, did you see
    the latency of integer multiplication instructions in modern x86
    processors?...

        I ran some experiments timing things.

        It looks like that replacing 16384 * by bitShift:14 leads to a small
        gain, bitShift (primitive 17) being faster than * (primitive 9)


    Keep in mind those operations still have to check for overflow into
    large integers.  In this case, large integers are not necessary.

    Andres.

        The bytecode is identical, except send: bitShift instead of send: *

        multiplication3
        | low |

        low := self bitAnd: 16383.
        ^(16r260D * low + ((16r260D * (self bitShift: -14) + (16r0065 * low)
        bitAnd: 16383) bitShift: 14))
        bitAnd: 16r0FFFFFFF



        [500000 timesRepeat: [ 15000 hashMultiply ]] timeToRun 12
        [500000 timesRepeat: [ 15000 multiplication ]] timeToRun 41  (worse)
        [500000 timesRepeat: [ 15000 multiplication3 ]] timeToRun 10
        (better)

        It looks like correct for SmallInteger minVal to: SmallInteger
        maxVal

        Now, VW gives: [500000 timesRepeat: [ 15000 hashMultiply ]]
        timeToRun
        1.149 milliseconds

        Definitely worth investigating the primitive thing, or some NB
        Asm as
        this is used about everywhere (Collections etc).

        Toughts?

        Phil




Reply via email to