On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn <d.s.seljeb...@astro.uio.no> wrote: > On 06/07/2012 12:35 PM, Dag Sverre Seljebotn wrote: >> >> On 06/07/2012 12:20 PM, Dag Sverre Seljebotn wrote: >>> >>> On 06/07/2012 12:26 AM, Robert Bradshaw wrote: >>>> >>>> On Wed, Jun 6, 2012 at 2:36 PM, Dag Sverre Seljebotn >>>> <d.s.seljeb...@astro.uio.no> wrote: >>>>> >>>>> On 06/06/2012 11:16 PM, Robert Bradshaw wrote: >>>>>> >>>>>> >>>>>> On Wed, Jun 6, 2012 at 1:57 PM, Dag Sverre Seljebotn >>>>>> <d.s.seljeb...@astro.uio.no> wrote: >>>>>>> >>>>>>> >>>>>>> On 06/06/2012 10:41 PM, Dag Sverre Seljebotn wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 06/05/2012 12:30 AM, Robert Bradshaw wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I just found http://cmph.sourceforge.net/ which looks quite >>>>>>>>> interesting. Though the resulting hash functions are supposedly >>>>>>>>> cheap, >>>>>>>>> I have the feeling that branching is considered cheap in this >>>>>>>>> context. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Actually, this lead was *very* promising. I believe the very first >>>>>>>> reference I actually read through and didn't eliminate after the >>>>>>>> abstract totally swept away our home-grown solutions! >>>>>>>> >>>>>>>> "Hash& Displace" by Pagh (1999) is actually very simple, easy to >>>>>>>> >>>>>>>> understand, and fast both for generation and (the branch-free) >>>>>>>> lookup: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.3753&rep=rep1&type=pdf >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> The idea is: >>>>>>>> >>>>>>>> - Find a hash `g(x)` to partition the keys into `b` groups (the >>>>>>>> paper >>>>>>>> requires b> 2n, though I think in practice you can often get away >>>>>>>> with >>>>>>>> less) >>>>>>>> >>>>>>>> - Find a hash `f(x)` such that f is 1:1 within each group (which is >>>>>>>> easily achieved since groups only has a few elements) >>>>>>>> >>>>>>>> - For each group, from largest to smallest: Find a displacement >>>>>>>> `d[group]` so that `f(x) ^ d` doesn't cause collisions. >>>>>>>> >>>>>>>> It requires extra storage for the displacement table. However, I >>>>>>>> think 8 >>>>>>>> bits per element might suffice even for vtables of 512 or 1024 in >>>>>>>> size. >>>>>>>> Even with 16 bits it's rather negligible compared to the >>>>>>>> minimum-128-bit >>>>>>>> entries of the table. >>>>>>>> >>>>>>>> I benchmarked these hash functions: >>>>>>>> >>>>>>>> displace1: ((h>> r1) ^ d[h& 63])& m1 >>>>>>>> displace2: ((h>> r1) ^ d[h& m2])& m1 >>>>>>>> displace3: ((h>> r1) ^ d[(h>> r2)& m2])& m1 >>>>>>>> >>>>>>>> >>>>>>>> Only the third one is truly in the spirit of the algorithm, but I >>>>>>>> think >>>>>>>> the first two should work well too (and when h is known >>>>>>>> compile-time, >>>>>>>> looking up d[h& 63] isn't harder than looking up r1 or m1). >>>>>>>> >>>>>>>> >>>>>>>> My computer is acting up and all my numbers today are slower than >>>>>>>> the >>>>>>>> earlier ones (yes, I've disabled turbo-mode in the BIOS for a year >>>>>>>> ago, >>>>>>>> and yes, I've pinned the CPU speed). But here's today's numbers, >>>>>>>> compiled with -DIMHASH: >>>>>>>> >>>>>>>> direct: min=5.37e-09 mean=5.39e-09 std=1.96e-11 >>>>>>>> val=2400000000.000000 >>>>>>>> index: min=6.45e-09 mean=6.46e-09 std=1.15e-11 val=1800000000.000000 >>>>>>>> twoshift: min=6.99e-09 mean=7.00e-09 std=1.35e-11 >>>>>>>> val=1800000000.000000 >>>>>>>> threeshift: min=7.53e-09 mean=7.54e-09 std=1.63e-11 >>>>>>>> val=1800000000.000000 >>>>>>>> displace1: min=6.99e-09 mean=7.00e-09 std=1.66e-11 >>>>>>>> val=1800000000.000000 >>>>>>>> displace2: min=6.99e-09 mean=7.02e-09 std=2.77e-11 >>>>>>>> val=1800000000.000000 >>>>>>>> displace3: min=7.52e-09 mean=7.54e-09 std=1.19e-11 >>>>>>>> val=1800000000.000000 >>>>>>>> >>>>>>>> >>>>>>>> I did a dirty prototype of the table-finder as well and it works: >>>>>>>> >>>>>>>> https://github.com/dagss/hashvtable/blob/master/pagh99.py >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> The paper obviously puts more effort on minimizing table size and >>>>>>> not a >>>>>>> fast >>>>>>> lookup. My hunch is that our choice should be >>>>>>> >>>>>>> ((h>> table.r) ^ table.d[h& m2])& m1 >>>>>>> >>>>>>> >>>>>>> and use 8-bits d (because even if you have 1024 methods, you'd rather >>>>>>> double >>>>>>> the number of bins than those 2 extra bits available for displacement >>>>>>> options). >>>>>>> >>>>>>> Then keep incrementing the size of d and the number of table slots >>>>>>> (in >>>>>>> such >>>>>>> an order that the total vtable size is minimized) until success. In >>>>>>> practice >>>>>>> this should almost always just increase the size of d, and keep the >>>>>>> table >>>>>>> size at the lowest 2**k that fits the slots (even for 64 methods or >>>>>>> 128 >>>>>>> methods :-)) >>>>>>> >>>>>>> Essentially we avoid the shift in the argument to d[] by making d >>>>>>> larger. >>>>>> >>>>>> >>>>>> >>>>>> Nice. I'm surprised that the indirection on d doesn't cost us much; >>>>> >>>>> >>>>> >>>>> Well, table->d[const& const] compiles down to the same kind of code as >>>>> table->m1. I guess I'm surprised too that displace2 doesn't penalize. >>>>> >>>>> >>>>>> hopefully its size wouldn't be a big issue either. What kinds of >>>>>> densities were you achieving? >>> >>> >>> OK, simulation results just in (for the displace2 hash), and they >>> exceeded my expectations. >>> >>> I always fill the table with n=2^k keys, and fix b = n (b means |d|). >>> Then the failure rates are (top two are 100,000 simulations, the rest >>> are 1000 simulations): >>> >>> n= 8 b= 8 failure-rate=0.0019 try-mean=4.40 try-max=65 >>> n= 16 b= 16 failure-rate=0.0008 try-mean=5.02 try-max=65 >>> n= 32 b= 32 failure-rate=0.0000 try-mean=5.67 try-max=25 >>> n= 64 b= 64 failure-rate=0.0000 try-mean=6.60 try-max=29 >>> n= 128 b= 128 failure-rate=0.0000 try-mean=7.64 try-max=22 >>> n= 256 b= 256 failure-rate=0.0000 try-mean=8.66 try-max=37 >>> n= 512 b= 512 failure-rate=0.0000 try-mean=9.57 try-max=26 >>> n=1024 b= 1024 failure-rate=0.0000 try-mean=10.66 try-max=34 >>> >>> Try-mean and try-max is how many r's needed to be tried before success, >>> so it gives an indication how much is left before failure. >>> >>> For the ~1/1000 chance of failure for n=8 and n=16, we would proceed to >>> let b=2*n (100,000 simulations): >>> >>> n= 8 b= 16 failure-rate=0.0001 try-mean=2.43 try-max=65 >>> n= 16 b= 32 failure-rate=0.0000 try-mean=3.40 try-max=65 >>> >>> NOTE: The 512...2048 results were with 16 bits displacements, with 8 bit >>> displacements they mostly failed. So we either need to make each element >>> of d 16 bits, or, e.g., store 512 entries in a 1024-slot table (which >>> succeeded most of the time with 8 bit displacements). I'm +1 on 16 bits >>> displacements. >>> >>> The algorithm is rather fast and concise: >>> >>> https://github.com/dagss/hashvtable/blob/master/pagh99.py >>> >>>>> The algorithm is designed for 100% density in the table itself. (We >>>>> can lift >>>>> that to compensate for a small space of possible hash functions I >>>>> guess.) >>>>> >>>>> I haven't done proper simulations yet, but I just tried |vtable|=128, >>>>> |d|=128 from the command line and I had 15 successes or so before the >>>>> first >>>>> failure. That's with a 100% density in the vtable itself! (And when it >>>>> fails, you increase |d| to get your success). >>>>> >>>>> The caveat is the space spent on d (it's small in comparison, but >>>>> that's why >>>>> this isn't too good to be true). >>>>> >>>>> A disadvantage might be that we may no longer have the opportunity to >>>>> not >>>>> make the table size a power of two (i.e. replace the mask with "if >>>>> (likely(slot< n))"). I think for that to work one would need to >>>>> replace the >>>>> xor group with addition on Z_d. >>>>> >>>>> >>>>>> Going back to the idea of linear probing on a cache miss, this has the >>>>>> advantage that one can write a brain-dead provider that sets m=0 and >>>>>> simply lists the methods instead of requiring a table optimizer. (Most >>>>>> tools, of course, would do the table optimization.) It also lets you >>>>>> get away with a "kind-of good" hash rather than requiring you search >>>>>> until you find a (larger?) perfect one. >>>>> >>>>> >>>>> >>>>> Well, given that we can have 100% density, and generating the table is >>>>> lightning fast, and the C code to generate the table is likely a 300 >>>>> line >>>>> utility... I'm not convinced. >>>> >>>> >>>> It goes from an extraordinary simple spec (table is, at minimum, a >>>> func[2^k] with a couple of extra zero fields, whose struct can be >>>> statically defined in the source by hand) to a, well, not complicated >>>> in the absolute sense, but much more so than the definition above. It >>>> also is variable-size which makes allocating it globally/on a stack a >>>> pain (I suppose one can choose an upper bound for |d| and |vtable|). >>>> >>>> I am a bit playing devil's advocate here, it's probably just a (minor) >>>> con, but worth noting at least. >>> >>> >>> If you were willing to go the interning route, so that you didn't need >>> to fill the table with md5 hashes anyway, I'd say you'd have a stronger >>> point :-) >>> >>> Given the results above, static allocation can at least be solved in a >>> way that is probably user-friendly enough: >>> >>> PyHashVTable_16_16 mytable; >>> >>> ...init () { >>> mytable.functions = { ... }; >>> if (PyHashVTable_Ready((PyHashVTable*)mytable, 16, 16) == -1) return -1; >>> } >>> >>> Now, with chance ~1/1000, you're going to get an exception saying >>> "Please try PyHashVTable_16_32". (And since that's deterministic given >>> the function definitions you always catch it at once.) >> >> >> PS. PyHashVTable_Ready would do the md5's and reorder the functions etc. >> as well. > > > > There's still the indirection through SEP 200 (extensibletype slots). We can > get rid of that very easily by just making that table and the hash-vtable > one and the same. (It could still either have interned string keys or ID > keys depending on the least significant bit.)
Or we can even forgo the interning for this table, and give up on partitioning the space numerically and just use the dns-style prefixing, e.g. "org.cython.X" belongs to us. There is value in the double indirection if this (or any of the other) lookup tables are meant to be modified over time. > To wrap up, I think this has grown in complexity beyond the "simple SEP > spec". It's at the point where you don't really want to have several > libraries implementing the same simple spec, but instead use the same > implementation. > > But I think the advantages are simply too good to give up on. > > So I think a viable route forward is to forget the CEP/SEP/pre-PEP-approach > for now (which only works for semi-complicated ideas with simple > implementations) and instead simply work more directly on a library. It > would need to have a couple of different use modes: I prefer an enhancement proposal with a spec over a library, even if only a single library gets used in practice. I still think it's simple enough. Basically, we have the "lookup spec" and then a CEP for applying this to fast callable (agreeing on signatures, and what to do with extern types) and extensible type slots. > - A Python perfect-hasher for use when generating code, with only the a > string interner based on CPython dicts and extensibletype metaclass as > runtime dependencies (for use in Cython). This would only add some hundred > source file lines... > > - A C implementation of the perfect hashing exposed through a > PyPerfectHashTable_Ready(), for use in libraries written in C like > NumPy/SciPy). This would need to bundle the md5 algorithm and a C > implementation of the perfect hashing. > > And on the distribution axis: > > - Small C header-style implementation of a string interner and the > extensibletype metaclass, rendezvousing through sys.modules > > - As part of the rendezvous, one would always try to __import__ the *real* > run-time library. So if it is available in sys.path it overrides anything > bundled with other libraries. That would provide a way forward for GIL-less > string interning, a Python-side library for working with these tables and > inspecting them, etc. Hmm, that's an interesting idea. I think we don't actually need interning, in which case the "full" library is only needed for introspection. > Time to stop talking and start coding... > > > Dag > _______________________________________________ > cython-devel mailing list > cython-devel@python.org > http://mail.python.org/mailman/listinfo/cython-devel _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel