Re: [Pharo-dev] hashMultiply

[email protected] Sun, 02 Mar 2014 23:45:07 -0800

I got it working and kicked the tires a bit.

It would be useful to have such a tool in Pharo :-)


Now, time to dig.

Phil


On Mon, Mar 3, 2014 at 12:13 AM, Andres Valloud <
[email protected]> wrote:

> Once the code is loaded, from the Tools menu use Hash Analysis Tool.
> There's a manual below, and also the Fundamentals book has a somewhat in
> depth discussion on how it works.
>
> ftp://sqrmax.us.to/pub/Smalltalk/Papers/Hash%20Analysis%20Tool.pdf
>
> On 3/2/14 15:06 , [email protected] wrote:
>
>> I have found the tools in the Cincom public repo.
>>
>> Now, I have to see how this works in VW, I am not that proficient with it.
>>
>> Phil
>>
>>
>>
>> On Wed, Feb 26, 2014 at 8:55 AM, [email protected]
>> <mailto:[email protected]> <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     Andres,
>>
>>     Thanks for the insights.
>>
>>     hash quality is indeed an key factor. At least, my mind is somewhat
>>     grasping this hashing field a bit better.
>>
>>     I'll have a look at the tools, I haven't used them yet.
>>
>>     And a shot at the ASM version with NativeBoost in Pharo.
>>
>>     For what occurs in modern CPUs, well, no. Surprising to see that a
>>     mul would be faster than a shr or shl. How comes?
>>     I used to be ok with these things when I was writing demoscene code
>>     a loong time ago but I'd need an extremely serious refresh.
>>
>>     As a side note, there is a huge uptake in the
>>     BigData/mapreduce/hadoop environment where Smalltalk is sorely
>>     absent. Scala seems to fill the void on the JVM.
>>     There hashing is quite key, to remap all of the mapping phase
>>     results to the reduce nodes. I am surprised to see that Smalltalk
>>     vendors haven't jumped in that space.
>>
>>     Phil
>>
>>
>>     On Wed, Feb 26, 2014 at 2:04 AM, Andres Valloud
>>     <[email protected]
>>     <mailto:[email protected]>> wrote:
>>
>>         Hello...
>>
>>         On 2/25/14 1:17 , [email protected] <mailto:[email protected]>
>>         wrote:
>>
>>             I am currently reading through the Hashing in Smalltalk book
>>             (http://www.lulu.com/shop/__andres-valloud/hashing-in-__
>> smalltalk-theory-and-practice/__paperback/product-3788892.html
>>             <http://www.lulu.com/shop/andres-valloud/hashing-in-
>> smalltalk-theory-and-practice/paperback/product-3788892.html>__)
>>             and, my head hurting notwithstanding, there are indeed a ton
>>             of gems in
>>             this system. As he mentions, doing the exercises brings a
>>             lot of extra :-)
>>
>>
>>         :) thank you.
>>
>>             When going to 64-bit, and with the new ObjectMemory scheme,
>>             I guess a
>>             couple of identity hashing functions will come under scrutiny.
>>
>>             e.g.
>>
>>             SmallInteger>>hashMultiply
>>             | low |
>>
>>             low := self bitAnd: 16383.
>>             ^(16r260D * low + ((16r260D * (self bitShift: -14) +
>>             (16r0065 * low)
>>             bitAnd: 16383) * 16384))
>>             bitAnd: 16r0FFFFFFF
>>
>>
>>             which will need some more bits.
>>
>>
>>         IMO it's not clear that SmallInteger>>identityHash should be
>>         implemented that way.  Finding a permutation of the small
>>         integers that also behaves like a good quality hash function and
>>         evaluates quickly (in significantly less time and complexity
>>         than, say, Bob Jenkins' lookup3) would be a really interesting
>>         research project.  I don't know if it's possible.  If no such
>>         thing exists, then getting at least some rough idea of what's
>>         the minimum necessary complexity for such hash functions would
>>         be valuable.
>>
>>         Looking at hashMultiply as a non-identity hash function, one
>>         would start having problems when significantly more than 2^28
>>         objects are stored in a single hashed collection.  2^28 objects
>>         with e.g. 12 bytes per header and a minimum of one instance
>>         variable (so the hash value isn't a instance-constant) stored in
>>         a hashed collection requires more than 4gb, so that is clearly a
>>         64 bit image problem.  In 64 bits, 2^28 objects with e.g. 16
>>         bytes per header and a minimum of one instance variable each is
>>         already 6gb, and a minimum of 8gb with the hashed collection
>> itself.
>>
>>         Because of these figures, I'd think improving the implementation
>>         of hashed collections takes priority over adding more
>>         non-identity hash function bits (as long as the existing hash
>>         values are of good quality).
>>
>>         Did you look at the Hash Analysis Tool I wrote?  It's in the
>>         Cincom public Store repository.  It comes in two bundles: Hash
>>         Analysis Tool, and Hash Analysis Tool - Extensions.  With
>>         everything loaded, the tool comes with 300+ hash functions and
>>         100+ data sets out of the box.  The code is MIT.
>>
>>             I had a look at how it was done in VisualWorks;
>>
>>
>>         The implementation of hashMultiply, yes.  Note however that
>>         SmallInteger>>hash is ^self.
>>
>>             hashMultiply
>>             "Multiply the receiver by 16r0019660D mod 2^28
>>             without using large integer arithmetic for speed.
>>             The constant is a generator of the multiplicative
>>             subgroup of Z_2^30, see Knuth's TAOCP vol 2."
>>             <primitive: 1747>
>>             | low14Bits |
>>             low14Bits := self bitAnd: 16r3FFF.
>>             ^16384
>>             * (16r260D * (self bitShift: -14) + (16r0065 * low14Bits)
>>             bitAnd: 16r3FFF)
>>             + (16r260D * low14Bits) bitAnd: 16rFFFFFFF
>>
>>             The hashing book version has:
>>
>>             multiplication
>>             "Computes self times 1664525 mod 2^38 while avoiding
>>             overflow into a
>>             large integer by making the multiplication into two 14 bits
>>             chunks. Do
>>             not use any division or modulo operation."
>>             | lowBits highBits|
>>
>>             lowBits := self bitAnd: 16r3FFF.
>>             highBits := self bitShift: -14.
>>             ^(lowBits * 16r260D)
>>                 + (((lowBits * 16r0065) bitAnd: 16r3FFF) bitShift: 14)
>>                 + (((highBits * 16r260D) bitAnd: 16r3FFF) bitShift: 14)
>>             bitAnd: 16rFFFFFFF
>>
>>             So, 16384 * is the same as bitShift: 14 and it looks like
>>             done once,
>>             which may be better.
>>
>>
>>         It should be a primitive (or otherwise optimized somehow).  At
>>         some point though that hash function was implemented for e.g.
>>         ByteArray in Squeak, I thought at that point the multiplication
>>         step was also implemented as a primitive?
>>
>>             Also VW marks it as a primitive, which Pharo does not.
>>
>>
>>         In VW it is also a translated primitive, i.e. it's executed
>>         directly in the JIT without calling C.
>>
>>         Keep in mind that the speed at which hash values are calculated
>>         is only part of the story.  If the hash function quality is not
>>         great, or the hashed collection implementation is not efficient
>>         and induces collisions or other extra work, improving the
>>         efficiency of the hash functions won't do much.  I think it's
>>         mentioned in the hash book (I'd have to check), but once I made
>>         a hash function 5x times slower to get better quality and the
>>         result was that a report that was taking 30 minutes took 90
>>         seconds instead (and hashing was gone from the profiler output).
>>
>>             Would we gain
>>             some speed doing that? hashMultiply is used a lof for
>>             identity hashes.
>>
>>             Bytecode has quite some work to do:
>>
>>             37 <70> self
>>             38 <20> pushConstant: 16383
>>             39 <BE> send: bitAnd:
>>             40 <68> popIntoTemp: 0
>>             41 <21> pushConstant: 9741
>>             42 <10> pushTemp: 0
>>             43 <B8> send: *
>>             44 <21> pushConstant: 9741
>>             45 <70> self
>>             46 <22> pushConstant: -14
>>             47 <BC> send: bitShift:
>>             48 <B8> send: *
>>             49 <23> pushConstant: 101
>>             50 <10> pushTemp: 0
>>             51 <B8> send: *
>>             52 <B0> send: +
>>             53 <20> pushConstant: 16383
>>             54 <BE> send: bitAnd:
>>             55 <24> pushConstant: 16384
>>             56 <B8> send: *
>>             57 <B0> send: +
>>             58 <25> pushConstant: 268435455
>>             59 <BE> send: bitAnd:
>>             60 <7C> returnTop
>>
>>
>>         If this is a primitive instead, then you can also avoid the
>>         overflow into large integers and do the math with (basically)
>>
>>         mov eax, smallInteger
>>         shr eax, numberOfTagBits
>>         mul eax, 1664525 "the multiplication that throws out the high
>> bits"
>>         shl eax, 4 "throw out bits 29-32"
>>         shr eax, 4
>>         lea eax, [eax * 2^numberOfTagBits + smallIntegerTagBits]
>>
>>         Please excuse trivial omissions in the above, it's written only
>>         for the sake of illustration (e.g. it looks like the 3 last
>>         instructions can be combined into two... lea followed by shr).
>>           Also, did you see the latency of integer multiplication
>>         instructions in modern x86 processors?...
>>
>>             I ran some experiments timing things.
>>
>>             It looks like that replacing 16384 * by bitShift:14 leads to
>>             a small
>>             gain, bitShift (primitive 17) being faster than * (primitive
>> 9)
>>
>>
>>         Keep in mind those operations still have to check for overflow
>>         into large integers.  In this case, large integers are not
>>         necessary.
>>
>>         Andres.
>>
>>             The bytecode is identical, except send: bitShift instead of
>>             send: *
>>
>>             multiplication3
>>             | low |
>>
>>             low := self bitAnd: 16383.
>>             ^(16r260D * low + ((16r260D * (self bitShift: -14) +
>>             (16r0065 * low)
>>             bitAnd: 16383) bitShift: 14))
>>             bitAnd: 16r0FFFFFFF
>>
>>
>>
>>             [500000 timesRepeat: [ 15000 hashMultiply ]] timeToRun 12
>>             [500000 timesRepeat: [ 15000 multiplication ]] timeToRun 41
>>               (worse)
>>             [500000 timesRepeat: [ 15000 multiplication3 ]] timeToRun 10
>>             (better)
>>
>>             It looks like correct for SmallInteger minVal to:
>>             SmallInteger maxVal
>>
>>             Now, VW gives: [500000 timesRepeat: [ 15000 hashMultiply ]]
>>             timeToRun
>>             1.149 milliseconds
>>
>>             Definitely worth investigating the primitive thing, or some
>>             NB Asm as
>>             this is used about everywhere (Collections etc).
>>
>>             Toughts?
>>
>>             Phil
>>
>>
>>
>>
>>
>

Re: [Pharo-dev] hashMultiply

Reply via email to