I have found the tools in the Cincom public repo. Now, I have to see how this works in VW, I am not that proficient with it.
Phil On Wed, Feb 26, 2014 at 8:55 AM, [email protected] <[email protected]>wrote: > Andres, > > Thanks for the insights. > > hash quality is indeed an key factor. At least, my mind is somewhat > grasping this hashing field a bit better. > > I'll have a look at the tools, I haven't used them yet. > > And a shot at the ASM version with NativeBoost in Pharo. > > For what occurs in modern CPUs, well, no. Surprising to see that a mul > would be faster than a shr or shl. How comes? > I used to be ok with these things when I was writing demoscene code a > loong time ago but I'd need an extremely serious refresh. > > As a side note, there is a huge uptake in the BigData/mapreduce/hadoop > environment where Smalltalk is sorely absent. Scala seems to fill the void > on the JVM. > There hashing is quite key, to remap all of the mapping phase results to > the reduce nodes. I am surprised to see that Smalltalk vendors haven't > jumped in that space. > > Phil > > > On Wed, Feb 26, 2014 at 2:04 AM, Andres Valloud < > [email protected]> wrote: > >> Hello... >> >> On 2/25/14 1:17 , [email protected] wrote: >> >>> I am currently reading through the Hashing in Smalltalk book >>> (http://www.lulu.com/shop/andres-valloud/hashing-in- >>> smalltalk-theory-and-practice/paperback/product-3788892.html) >>> and, my head hurting notwithstanding, there are indeed a ton of gems in >>> this system. As he mentions, doing the exercises brings a lot of extra >>> :-) >>> >> >> :) thank you. >> >> When going to 64-bit, and with the new ObjectMemory scheme, I guess a >>> couple of identity hashing functions will come under scrutiny. >>> >>> e.g. >>> >>> SmallInteger>>hashMultiply >>> | low | >>> >>> low := self bitAnd: 16383. >>> ^(16r260D * low + ((16r260D * (self bitShift: -14) + (16r0065 * low) >>> bitAnd: 16383) * 16384)) >>> bitAnd: 16r0FFFFFFF >>> >>> >>> which will need some more bits. >>> >> >> IMO it's not clear that SmallInteger>>identityHash should be implemented >> that way. Finding a permutation of the small integers that also behaves >> like a good quality hash function and evaluates quickly (in significantly >> less time and complexity than, say, Bob Jenkins' lookup3) would be a really >> interesting research project. I don't know if it's possible. If no such >> thing exists, then getting at least some rough idea of what's the minimum >> necessary complexity for such hash functions would be valuable. >> >> Looking at hashMultiply as a non-identity hash function, one would start >> having problems when significantly more than 2^28 objects are stored in a >> single hashed collection. 2^28 objects with e.g. 12 bytes per header and a >> minimum of one instance variable (so the hash value isn't a >> instance-constant) stored in a hashed collection requires more than 4gb, so >> that is clearly a 64 bit image problem. In 64 bits, 2^28 objects with e.g. >> 16 bytes per header and a minimum of one instance variable each is already >> 6gb, and a minimum of 8gb with the hashed collection itself. >> >> Because of these figures, I'd think improving the implementation of >> hashed collections takes priority over adding more non-identity hash >> function bits (as long as the existing hash values are of good quality). >> >> Did you look at the Hash Analysis Tool I wrote? It's in the Cincom >> public Store repository. It comes in two bundles: Hash Analysis Tool, and >> Hash Analysis Tool - Extensions. With everything loaded, the tool comes >> with 300+ hash functions and 100+ data sets out of the box. The code is >> MIT. >> >> I had a look at how it was done in VisualWorks; >>> >> >> The implementation of hashMultiply, yes. Note however that >> SmallInteger>>hash is ^self. >> >> hashMultiply >>> "Multiply the receiver by 16r0019660D mod 2^28 >>> without using large integer arithmetic for speed. >>> The constant is a generator of the multiplicative >>> subgroup of Z_2^30, see Knuth's TAOCP vol 2." >>> <primitive: 1747> >>> | low14Bits | >>> low14Bits := self bitAnd: 16r3FFF. >>> ^16384 >>> * (16r260D * (self bitShift: -14) + (16r0065 * low14Bits) bitAnd: >>> 16r3FFF) >>> + (16r260D * low14Bits) bitAnd: 16rFFFFFFF >>> >>> The hashing book version has: >>> >>> multiplication >>> "Computes self times 1664525 mod 2^38 while avoiding overflow into a >>> large integer by making the multiplication into two 14 bits chunks. Do >>> not use any division or modulo operation." >>> | lowBits highBits| >>> >>> lowBits := self bitAnd: 16r3FFF. >>> highBits := self bitShift: -14. >>> ^(lowBits * 16r260D) >>> + (((lowBits * 16r0065) bitAnd: 16r3FFF) bitShift: 14) >>> + (((highBits * 16r260D) bitAnd: 16r3FFF) bitShift: 14) >>> bitAnd: 16rFFFFFFF >>> >>> So, 16384 * is the same as bitShift: 14 and it looks like done once, >>> which may be better. >>> >> >> It should be a primitive (or otherwise optimized somehow). At some point >> though that hash function was implemented for e.g. ByteArray in Squeak, I >> thought at that point the multiplication step was also implemented as a >> primitive? >> >> Also VW marks it as a primitive, which Pharo does not. >>> >> >> In VW it is also a translated primitive, i.e. it's executed directly in >> the JIT without calling C. >> >> Keep in mind that the speed at which hash values are calculated is only >> part of the story. If the hash function quality is not great, or the >> hashed collection implementation is not efficient and induces collisions or >> other extra work, improving the efficiency of the hash functions won't do >> much. I think it's mentioned in the hash book (I'd have to check), but >> once I made a hash function 5x times slower to get better quality and the >> result was that a report that was taking 30 minutes took 90 seconds instead >> (and hashing was gone from the profiler output). >> >> Would we gain >>> some speed doing that? hashMultiply is used a lof for identity hashes. >>> >>> Bytecode has quite some work to do: >>> >>> 37 <70> self >>> 38 <20> pushConstant: 16383 >>> 39 <BE> send: bitAnd: >>> 40 <68> popIntoTemp: 0 >>> 41 <21> pushConstant: 9741 >>> 42 <10> pushTemp: 0 >>> 43 <B8> send: * >>> 44 <21> pushConstant: 9741 >>> 45 <70> self >>> 46 <22> pushConstant: -14 >>> 47 <BC> send: bitShift: >>> 48 <B8> send: * >>> 49 <23> pushConstant: 101 >>> 50 <10> pushTemp: 0 >>> 51 <B8> send: * >>> 52 <B0> send: + >>> 53 <20> pushConstant: 16383 >>> 54 <BE> send: bitAnd: >>> 55 <24> pushConstant: 16384 >>> 56 <B8> send: * >>> 57 <B0> send: + >>> 58 <25> pushConstant: 268435455 >>> 59 <BE> send: bitAnd: >>> 60 <7C> returnTop >>> >> >> If this is a primitive instead, then you can also avoid the overflow into >> large integers and do the math with (basically) >> >> mov eax, smallInteger >> shr eax, numberOfTagBits >> mul eax, 1664525 "the multiplication that throws out the high bits" >> shl eax, 4 "throw out bits 29-32" >> shr eax, 4 >> lea eax, [eax * 2^numberOfTagBits + smallIntegerTagBits] >> >> Please excuse trivial omissions in the above, it's written only for the >> sake of illustration (e.g. it looks like the 3 last instructions can be >> combined into two... lea followed by shr). Also, did you see the latency >> of integer multiplication instructions in modern x86 processors?... >> >> I ran some experiments timing things. >>> >>> It looks like that replacing 16384 * by bitShift:14 leads to a small >>> gain, bitShift (primitive 17) being faster than * (primitive 9) >>> >> >> Keep in mind those operations still have to check for overflow into large >> integers. In this case, large integers are not necessary. >> >> Andres. >> >> The bytecode is identical, except send: bitShift instead of send: * >>> >>> multiplication3 >>> | low | >>> >>> low := self bitAnd: 16383. >>> ^(16r260D * low + ((16r260D * (self bitShift: -14) + (16r0065 * low) >>> bitAnd: 16383) bitShift: 14)) >>> bitAnd: 16r0FFFFFFF >>> >>> >>> >>> [500000 timesRepeat: [ 15000 hashMultiply ]] timeToRun 12 >>> [500000 timesRepeat: [ 15000 multiplication ]] timeToRun 41 (worse) >>> [500000 timesRepeat: [ 15000 multiplication3 ]] timeToRun 10 (better) >>> >>> It looks like correct for SmallInteger minVal to: SmallInteger maxVal >>> >>> Now, VW gives: [500000 timesRepeat: [ 15000 hashMultiply ]] timeToRun >>> 1.149 milliseconds >>> >>> Definitely worth investigating the primitive thing, or some NB Asm as >>> this is used about everywhere (Collections etc). >>> >>> Toughts? >>> >>> Phil >>> >> >> >
