Re: [Pharo-dev] hashMultiply

[email protected] Sun, 02 Mar 2014 15:09:17 -0800

I have found the tools in the Cincom public repo.

Now, I have to see how this works in VW, I am not that proficient with it.


Phil



On Wed, Feb 26, 2014 at 8:55 AM, [email protected] <[email protected]>wrote:

> Andres,
>
> Thanks for the insights.
>
> hash quality is indeed an key factor. At least, my mind is somewhat
> grasping this hashing field a bit better.
>
> I'll have a look at the tools, I haven't used them yet.
>
> And a shot at the ASM version with NativeBoost in Pharo.
>
> For what occurs in modern CPUs, well, no. Surprising to see that a mul
> would be faster than a shr or shl. How comes?
> I used to be ok with these things when I was writing demoscene code a
> loong time ago but I'd need an extremely serious refresh.
>
> As a side note, there is a huge uptake in the BigData/mapreduce/hadoop
> environment where Smalltalk is sorely absent. Scala seems to fill the void
> on the JVM.
> There hashing is quite key, to remap all of the mapping phase results to
> the reduce nodes. I am surprised to see that Smalltalk vendors haven't
> jumped in that space.
>
> Phil
>
>
> On Wed, Feb 26, 2014 at 2:04 AM, Andres Valloud <
> [email protected]> wrote:
>
>> Hello...
>>
>> On 2/25/14 1:17 , [email protected] wrote:
>>
>>> I am currently reading through the Hashing in Smalltalk book
>>> (http://www.lulu.com/shop/andres-valloud/hashing-in-
>>> smalltalk-theory-and-practice/paperback/product-3788892.html)
>>> and, my head hurting notwithstanding, there are indeed a ton of gems in
>>> this system. As he mentions, doing the exercises brings a lot of extra
>>> :-)
>>>
>>
>> :) thank you.
>>
>>  When going to 64-bit, and with the new ObjectMemory scheme, I guess a
>>> couple of identity hashing functions will come under scrutiny.
>>>
>>> e.g.
>>>
>>> SmallInteger>>hashMultiply
>>> | low |
>>>
>>> low := self bitAnd: 16383.
>>> ^(16r260D * low + ((16r260D * (self bitShift: -14) + (16r0065 * low)
>>> bitAnd: 16383) * 16384))
>>> bitAnd: 16r0FFFFFFF
>>>
>>>
>>> which will need some more bits.
>>>
>>
>> IMO it's not clear that SmallInteger>>identityHash should be implemented
>> that way.  Finding a permutation of the small integers that also behaves
>> like a good quality hash function and evaluates quickly (in significantly
>> less time and complexity than, say, Bob Jenkins' lookup3) would be a really
>> interesting research project.  I don't know if it's possible.  If no such
>> thing exists, then getting at least some rough idea of what's the minimum
>> necessary complexity for such hash functions would be valuable.
>>
>> Looking at hashMultiply as a non-identity hash function, one would start
>> having problems when significantly more than 2^28 objects are stored in a
>> single hashed collection.  2^28 objects with e.g. 12 bytes per header and a
>> minimum of one instance variable (so the hash value isn't a
>> instance-constant) stored in a hashed collection requires more than 4gb, so
>> that is clearly a 64 bit image problem.  In 64 bits, 2^28 objects with e.g.
>> 16 bytes per header and a minimum of one instance variable each is already
>> 6gb, and a minimum of 8gb with the hashed collection itself.
>>
>> Because of these figures, I'd think improving the implementation of
>> hashed collections takes priority over adding more non-identity hash
>> function bits (as long as the existing hash values are of good quality).
>>
>> Did you look at the Hash Analysis Tool I wrote?  It's in the Cincom
>> public Store repository.  It comes in two bundles: Hash Analysis Tool, and
>> Hash Analysis Tool - Extensions.  With everything loaded, the tool comes
>> with 300+ hash functions and 100+ data sets out of the box.  The code is
>> MIT.
>>
>>  I had a look at how it was done in VisualWorks;
>>>
>>
>> The implementation of hashMultiply, yes.  Note however that
>> SmallInteger>>hash is ^self.
>>
>>  hashMultiply
>>> "Multiply the receiver by 16r0019660D mod 2^28
>>> without using large integer arithmetic for speed.
>>> The constant is a generator of the multiplicative
>>> subgroup of Z_2^30, see Knuth's TAOCP vol 2."
>>> <primitive: 1747>
>>> | low14Bits |
>>> low14Bits := self bitAnd: 16r3FFF.
>>> ^16384
>>> * (16r260D * (self bitShift: -14) + (16r0065 * low14Bits) bitAnd:
>>> 16r3FFF)
>>> + (16r260D * low14Bits) bitAnd: 16rFFFFFFF
>>>
>>> The hashing book version has:
>>>
>>> multiplication
>>> "Computes self times 1664525 mod 2^38 while avoiding overflow into a
>>> large integer by making the multiplication into two 14 bits chunks. Do
>>> not use any division or modulo operation."
>>> | lowBits highBits|
>>>
>>> lowBits := self bitAnd: 16r3FFF.
>>> highBits := self bitShift: -14.
>>> ^(lowBits * 16r260D)
>>>    + (((lowBits * 16r0065) bitAnd: 16r3FFF) bitShift: 14)
>>>    + (((highBits * 16r260D) bitAnd: 16r3FFF) bitShift: 14)
>>> bitAnd: 16rFFFFFFF
>>>
>>> So, 16384 * is the same as bitShift: 14 and it looks like done once,
>>> which may be better.
>>>
>>
>> It should be a primitive (or otherwise optimized somehow).  At some point
>> though that hash function was implemented for e.g. ByteArray in Squeak, I
>> thought at that point the multiplication step was also implemented as a
>> primitive?
>>
>>  Also VW marks it as a primitive, which Pharo does not.
>>>
>>
>> In VW it is also a translated primitive, i.e. it's executed directly in
>> the JIT without calling C.
>>
>> Keep in mind that the speed at which hash values are calculated is only
>> part of the story.  If the hash function quality is not great, or the
>> hashed collection implementation is not efficient and induces collisions or
>> other extra work, improving the efficiency of the hash functions won't do
>> much.  I think it's mentioned in the hash book (I'd have to check), but
>> once I made a hash function 5x times slower to get better quality and the
>> result was that a report that was taking 30 minutes took 90 seconds instead
>> (and hashing was gone from the profiler output).
>>
>>  Would we gain
>>> some speed doing that? hashMultiply is used a lof for identity hashes.
>>>
>>> Bytecode has quite some work to do:
>>>
>>> 37 <70> self
>>> 38 <20> pushConstant: 16383
>>> 39 <BE> send: bitAnd:
>>> 40 <68> popIntoTemp: 0
>>> 41 <21> pushConstant: 9741
>>> 42 <10> pushTemp: 0
>>> 43 <B8> send: *
>>> 44 <21> pushConstant: 9741
>>> 45 <70> self
>>> 46 <22> pushConstant: -14
>>> 47 <BC> send: bitShift:
>>> 48 <B8> send: *
>>> 49 <23> pushConstant: 101
>>> 50 <10> pushTemp: 0
>>> 51 <B8> send: *
>>> 52 <B0> send: +
>>> 53 <20> pushConstant: 16383
>>> 54 <BE> send: bitAnd:
>>> 55 <24> pushConstant: 16384
>>> 56 <B8> send: *
>>> 57 <B0> send: +
>>> 58 <25> pushConstant: 268435455
>>> 59 <BE> send: bitAnd:
>>> 60 <7C> returnTop
>>>
>>
>> If this is a primitive instead, then you can also avoid the overflow into
>> large integers and do the math with (basically)
>>
>> mov eax, smallInteger
>> shr eax, numberOfTagBits
>> mul eax, 1664525 "the multiplication that throws out the high bits"
>> shl eax, 4 "throw out bits 29-32"
>> shr eax, 4
>> lea eax, [eax * 2^numberOfTagBits + smallIntegerTagBits]
>>
>> Please excuse trivial omissions in the above, it's written only for the
>> sake of illustration (e.g. it looks like the 3 last instructions can be
>> combined into two... lea followed by shr).  Also, did you see the latency
>> of integer multiplication instructions in modern x86 processors?...
>>
>>  I ran some experiments timing things.
>>>
>>> It looks like that replacing 16384 * by bitShift:14 leads to a small
>>> gain, bitShift (primitive 17) being faster than * (primitive 9)
>>>
>>
>> Keep in mind those operations still have to check for overflow into large
>> integers.  In this case, large integers are not necessary.
>>
>> Andres.
>>
>>  The bytecode is identical, except send: bitShift instead of send: *
>>>
>>> multiplication3
>>> | low |
>>>
>>> low := self bitAnd: 16383.
>>> ^(16r260D * low + ((16r260D * (self bitShift: -14) + (16r0065 * low)
>>> bitAnd: 16383) bitShift: 14))
>>> bitAnd: 16r0FFFFFFF
>>>
>>>
>>>
>>> [500000 timesRepeat: [ 15000 hashMultiply ]] timeToRun 12
>>> [500000 timesRepeat: [ 15000 multiplication ]] timeToRun 41  (worse)
>>> [500000 timesRepeat: [ 15000 multiplication3 ]] timeToRun 10 (better)
>>>
>>> It looks like correct for SmallInteger minVal to: SmallInteger maxVal
>>>
>>> Now, VW gives: [500000 timesRepeat: [ 15000 hashMultiply ]] timeToRun
>>> 1.149 milliseconds
>>>
>>> Definitely worth investigating the primitive thing, or some NB Asm as
>>> this is used about everywhere (Collections etc).
>>>
>>> Toughts?
>>>
>>> Phil
>>>
>>
>>
>

Re: [Pharo-dev] hashMultiply

Reply via email to