[PHP-DEV] RE: New HashTable implementation?

Andone, Bogdan Fri, 29 Apr 2016 04:03:13 -0700


> -----Original Message-----
> From: Dmitry Stogov [mailto:dmi...@zend.com]
> Sent: Friday, April 29, 2016 10:52 AM
> To: Matt Wilmas <php_li...@realplain.com>; internals@lists.php.net
> Cc: Nikita Popov <nikita....@gmail.com>; Xinchen Hui <larue...@php.net>
> Subject: [PHP-DEV] Re: New HashTable implementation?
> 
> Hi Matt,
> 
> I also tried different hash functions (including mumur3) and came to similar
> result.
> Faster function but more hash collisions.
> Actually these collisions occurs not because of different hash values, but
> because of small hash array size.
> Also the way we transform hash value into hash array index (use low bits)
> also may matter.
> If we decrease hash load-factor, we get less collisions, but use more memory
> and this also leads to cache misses increase.
> 
> You are welcome to try Robin-Hood, in case you get positive results we may
> switch to it (however I'm sceptical)
> 
> Thanks. Dmitry.


Hi Matt,
Just two cents on my side after playing around a little bit as well with the 
hash tables and functions :) ...
As Dmitry stated with different words you can have a hash function with a great 
quality but unfortunately you will still have a lot of collisions as only a 
reduced number of bits of the hash value will be used for the lookup in small 
hash tables. Sometimes browsing collisions will be cheaper than calculating 
gold hash values. 

But browsing collisions are always cheap, I think. Another experiment I did 
based on the observation that a key is looked up 2-3 consecutive times 
frequently, was to have a simple LRU management of colliding bucket lists; the 
overhead was quite small (just the list refresh in case of a collision where 
the searched bucket was not in the first position) but big enough to generate 
overall worse results.

What we see at microarchitecture level of IA when running PHP7 on Wordpress for 
ex, is that branch miss-predictions and ICACHE misses are very high; overall we 
can talk about a kind of BPU (branch prediction unit) and ICACHE saturation; 
adding complexity to the implementation meaning bigger code and/or more 
conditional branches will have negative impacts if there is not an important 
gain in having less instructions executed. 

I write this just with the hope of giving some hints on possible gains/losses 
in your attempts. :-)
I will be not surprised if simpler and crappy hash functions together with 
cheap collision solver will drive to an overall better speed than other 
sophisticated and theoretical better approaches.

Hope it helps,
Bogdan
> 
> ________________________________________
> From: Matt Wilmas <php_li...@realplain.com>
> Sent: Friday, April 29, 2016 3:04:44 AM
> To: internals@lists.php.net
> Cc: Dmitry Stogov; Nikita Popov; Xinchen Hui
> Subject: New HashTable implementation?
> 
> Hi all,
> 
> Last June, it was briefly mentioned about changing PHP's string hash function
> [1] (DJB33 *seems* pretty horrible, after all, as far as collisions...).  So 8
> months ago I tried almost, if not, a half-dozen of them (including Murmur3)
> that were simple enough to quickly toss in.
> 
> The result?  In all cases, I believe, fewer instructions (in zend_hash_find, 
> and
> the hashing function), BUT also a slight-to-small increase in cache misses in
> Wordpress and other scripts...
> 
> And in a test filling an array with a million "string_$i" keys (high collision
> pattern for DJB33?), the speed was halved by the *huge* cache miss
> increase. :-/
> 
> I couldn't quite understand what was happening, where, if there were fewer
> collisions...  Misses all spread out in the hash-array?
> 
> So there didn't seem to be anything useful to gain there.
> 
> 
> Now, after seeing Bogdan's hash optimization idea last month [2], and
> reading Nikita's blog post again, I had some ideas I'd like to try -- assuming
> nobody else is planning major changes. :-)  Besides Nikita, I'm addressing
> Dmitry and Xinchen because your names are on some minor hash items on
> the 7.1 ideas wiki [4].
> 
> I'm thinking of a Robin Hood implementation with Universal hashing [5] (of
> int keys and the string hashes).  I haven't touched any code yet, but think 
> I've
> worked out all the details in my head, and am ready to take a stab at it.  I
> think it's fairly simple to get the basics working; and when I see how that
> goes, I can do the additional optimizations I have in mind that it allows
> (including reduced memory, on 64-bit at least).
> 
> Question: Can I use zval.u1.v.reserved ?  I guess I'll find out otherwise.
> :-O
> 
> 
> The string hash function itself is really a separate thing [6], but fasthash 
> [7]
> (not to be confused with "superfast") looks like a good one that I missed
> before...  After thinking about things, I think we could even keep/use a 
> 64-bit
> hash on 32-bit arch.
> 
> 
> Well, just wanted to mention it first if anyone has a comment. :-)  Should be
> interesting, but no idea how it'll perform (lookups should be very, very fast
> (upsizing also); but cache factors and inserts/deletes are wildcards).
> Wish me luck!?
> 
> 
> Thanks,
> Matt
> 
> [1] https://marc.info/?l=php-internals&m=143444845304138&w=2
> [2] https://marc.info/?t=145744248100001&r=1&w=2
> [4] https://wiki.php.net/php-7.1-ideas
> [5] https://en.wikipedia.org/wiki/Universal_hashing
> [6] https://github.com/rurban/smhasher
> [7] https://github.com/rurban/smhasher/blob/master/doc/fasthash64
> 
> 
> --
> PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit:
> http://www.php.net/unsub.php


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP-DEV] RE: New HashTable implementation?

Reply via email to