On Jul 11, 3:01 pm, [EMAIL PROTECTED] wrote: > I have found this perfect hash (minimal too) > implementation:http://burtleburtle.net/bob/hash/perfect.html > > I have already translated part of it to D, and it seems to work well > enough. As discussed in the PyConDue, I think this may be used in > frozenset (and frozendict) to build a (minimal too?) perfect hash on > the fly, to allow (hopefully) faster retrieval of items that don't > change.
A few thoughts: Objects currently have their own hash function that is independent of a given container. We also have to hash types other than strings (used in the examples in the links you provided). While a container such as a dict or set is being built, we don't even know how many unique keys there are until the container is fully populated. Set-to-set operations and set-to-dict operations get their speed from knowing that both containers use the same hash values -- this would get disrupted if each container had its own hash function. Some types like strings (especially interned strings) remember their own hash value -- this makes it very fast to look their values in a set or dict -- that would be defeated if each container has its own hash function which would need to be recomputed for every lookup. An important use case for sets is uniquification -- in that use case, speed is determined by insertion time, we just don't care about subsequent lookup time -- anything that slows down insertion (such as computing a perfect hash value) works to the detriment of that use case). In the current design, sets and dicts are never more than 2/3 full and are usually much more sparse than that -- accordingly lookups average between 1 and 1.5 probes per lookup. This is somewhat hard to beat with perfect hashing since we typically get very few collisions anyway -- so you get the cost of increased insertion time, loss of objects being able to remember their own hash, challenges with non- string keys, a more complicated algorithm, and decreased interoperability for set-to-set and set-to-dict options -- the only benefits are to reduce collisions to zero when they are already not that common. For frozensets, it may be possible to add an optimize() method that takes a completely built frozenset and optimizes its insertion order and/or increases its size to make it arbitrarily sparse. That would only be useful for cases when the costs of optimizing the table is fully repaid by an extensive number of lookups. There are use some cases where it would be pay table optimization costs in order to win decreased lookup costs, but there are plenty of use cases where that would not be a winning trade-off. So, the optimize() step would need to be optional, not builtin. Raymond FWIW, I mentioned all this at PyConDue but the message must have gotten lost. -- http://mail.python.org/mailman/listinfo/python-list