On Tue, Jun 5, 2012 at 1:07 AM, Stefan Behnel <stefan...@behnel.de> wrote: > Dag Sverre Seljebotn, 05.06.2012 00:07: >> The C FAQ says 'if you know the contents of your hash table up front you can >> devise a perfect hash', but no details, probably just hand-waving. >> >> 128 bits gives more entropy for perfect hashing: some but not much since >> each shift r is hardwired to one 64 bit subset. > > Perfect hashing can be done with any fixed size data set (although it's not > guaranteed to always be the most efficient solution). It doesn't matter if > you use 64 bits or 128 bits. If 4 bits is enough, go with that. The > advantage of perfect hashing of a fixed size data set is that the hash > table has no free space and a match is guaranteed to be exact.
The hash function is f(h(sig)) where f is parameterized but must be *extremely* cheap and h is fixed without regard to the entry set. This is why having 128 bits for the output of h may be an advantage. > However, the problem in this specific case is that the caller and the > callee do not agree on the same set of entries, so there may be collisions > during the lookup (of a potentially very large set of signatures) that were > not anticipated in the perfect hash table layout (of the much smaller set > of provided signatures). Perfect hashing works here as well, but it looses > one of its main advantage over other hashing schemes. You then have to > compare the entries exactly after the lookup in order to make sure that you > didn't run into a collision, thus loosing time again that you just won with > the hashing. > > But at least you only have to do exactly one such comparison, so that's an > advantage over a hashing scheme that allows collisions also in the layout. > Maybe you can even handle mismatches more quickly by adding a dedicated > "empty" entry for them that most (all?) anticipated mismatches would hash to. The idea is that the comparison would be cheap, a single 128-bit compare. The whole point is to avoid branching in the success case. I agree with you about 64-bit collisions being too high a risk. One could re-introduce the encoding/interning if desired, but I think we're safe in assuming no accidental md5 collisions (but hadn't thought much about the malicious case; if you're allowed to dictate function pointers you'd better have another line of defense. Perhaps this needs to be considered more.) We could even use sha1, though I thought the previous benchmarks indicated that comparing 160 bits was non-negligibly more expensive than comparing just 64. - Robert _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel