The best hashing performance requires deep understanding of the virtual memory address translation mechanism (aka translation lookaside buffer "TLB"). This is because the translation of virtual memory addresses to real memory addresses _already uses a builtin hardware hash table_. You might as well take advantage of this hardware help when accessing large hash tables.
http://en.wikipedia.org/wiki/Translation_lookaside_buffer http://en.wikipedia.org/wiki/Virtual_memory Hashing performance is also dependent upon the cache line size. It is sometimes possible to get better performance by using several levels of tables, if you can be assured that the top levels are essentially always in the cache and in the TLB. Also, overflows can be very fast _so long as the overflow scan occurs within the same cache line_. I presume that there are academic papers that address (!) this particular subject, but if not, a couple of days worth of experiments might be very educational. Common Lisp's SETF mechanism for installing items into a hash table has always been at odds with the best possible performance for hash tables in which a large % of the items will _never_ be accessed -- i.e., a _write-mostly_ hash table. The reason is that what you really want is an insert-if-not-already-there function which performs only one hashing operation. Also, Common Lisp severely crippled their built-in hash tables by not allowing full generality with a user-specified equality operator. As a result, I typically use Common Lisp hash tables for prototyping, but almost invariably have to write my own routines for the highest possible performance. It would also behoove Common Lisp implementors to allow the _inlining_ of hash table lookup & store operations. This would remove one more performance penalty when using the CL standard hash table operations. At 08:25 AM 11/14/2013, Camm Maguire wrote: >Greetings! This is just a note for posterity on some of the recent >learnings regarding hashing in GCL. Except for symbol lookups in >packages, gcl uses 'open addressing' hash tables, as is customary for a >one word stored 'value', i.e. a lisp object pointer. With this >strategy, collisions result in a linear probe of the table until either >an empty slot or the desired key is found. The negative performance >effects of the collision therefore occur not only for keys hashing to >the same index, but for those hashing to *adjacent* positions. Using a >cons address directly as an index will thus suffer, as conses are >frequently allocated in bulk in a contiguous region of memory. > >The current solution relies on our 'universal' hash function, i.e. a 256 >wide table of random 64bit numbers indirected by bytes of the key and >xored together, to not only distribute the hashed items uniformly >throughout the table, but also to minimize the possibility of >consecutive indices. I think these criteria should be fulfilled at >present, though there are explicit algorithmic solutions to the >consecutive index problem which do not impinge on the hash function. >These require more memory and complexity, and seem unwarranted at >present. > >What we were seeing with the recent performance degradation was a strong >dependence on the data layout as represented by the addresses in use. >As an historical aside we should note that GCL did randomize these >addresses in the past, but it was found to be 'too good' in the sense >that there were apparently ideal memory layouts which could be indexed >faster if the addresses were used directly and the randomization >skipped. I believe this discussion was in an ACL2 context some years >ago. This is just to note that we are now 'reverting that reversion', >and opting for a good average performance which (should be) independent >of data layout. > >Take care, >-- >Camm Maguire address@hidden >========================================================================== >"The earth is but one country, and mankind its citizens." -- Baha'u'llah _______________________________________________ Gcl-devel mailing list Gcl-devel@gnu.org https://lists.gnu.org/mailman/listinfo/gcl-devel