Robert Bradshaw <rober...@gmail.com> wrote:
>On Mon, Jun 4, 2012 at 1:55 PM, Dag Sverre Seljebotn ><d.s.seljeb...@astro.uio.no> wrote: >> On 06/04/2012 09:44 PM, Dag Sverre Seljebotn wrote: >>> >>> Me and Robert had a long discussion on the NumFOCUS list about this >>> already, but I figured it was better to continue it and provide more >>> in-depth benchmark results here. >>> >>> It's basically a new idea of how to provide a vtable based on >perfect >>> hashing, which should be a lot simpler to implement than what I >first >>> imagined. >>> >>> I'll write down some context first, if you're familiar with this >>> skip ahead a bit.. >>> >>> This means that you can do fast dispatches *without* the messy >>> business of binding vtable slots at compile time. To be concrete, >this >>> might e.g. take the form >>> >>> def f(obj): >>> obj.method(3.4) # try to find a vtable with "void method(double)" in >it >>> >>> or, a more typed approach, >>> >>> # File A >>> cdef class MyImpl: >>> def double method(double x): return x * x >>> >>> # File B >>> # Here we never know about MyImpl, hence "duck-typed" >>> @cython.interface >>> class MyIntf: >>> def double method(double x): pass >>> >>> def f(MyIntf obj): >>> # obj *can* be MyImpl instance, or whatever else that supports >>> # that interface >>> obj.method(3.4) >>> >>> >>> Now, the idea to implement this is: >>> >>> a) Both caller and callee pre-hash name/argument string >>> "mymethod:iidd" to 64 bits of hash data (probably lower 64 bits of >>> md5) >>> >>> b) Callee (MyImpl) generates a vtable of its methods by *perfect* >>> hashing. What you do is define a final hash fh as a function >>> of the pre-hash ph, for instance >>> >>> fh = ((ph >> vtable.r1) ^ (ph >> vtable.r2) ^ (ph >> vtable.r3)) & >>> vtable.m >>> >>> (Me and Robert are benchmarking different functions to use here.) By >>> playing with r1, r2, r3, you have 64**3 choices of hash function, >and >>> will be able to pick a combination which gives *no* (or very few) >>> collisions. >>> >>> c) Caller then combines the pre-hash generated at compile-time, with >>> r1, r2, r3, m stored in the vtable header, in order to find the >>> final location in the hash-table. >>> >>> The exciting thing is that in benchmark, the performance penalty is >>> actually very slight over a C++-style v-table. (Of course you can >>> cache a proper vtable, but the fact that you get so close without >>> caring about caching means that this can be done much faster.) > >One advantage about caching a vtable is that one can possibly put in >adapters for non-exact matches. It also opens up the possibility of >putting in stubs to call def methods if they exist. This needs to be >fleshed out more, (another CEP :) but could provide for a >backwards-compatible easy first implementation. > >>> Back to my and Robert's discussion on benchmarks: >>> >>> I've uploaded benchmarks here: >>> >>> https://github.com/dagss/hashvtable/tree/master/dispatchbench >>> >>> I've changed the benchmark taking to give more robust numbers (at >>> least for me), you want to look at the 'min' column. >>> >>> I changed the benchmark a bit so that it benchmarks a *callsite*. >>> So we don't pass 'h' on the stack, but either a) looks it up in a >global >>> variable (default), or b) it's a compile-time constant (immediate in >>> assembly) (compile with -DIMHASH). >>> >>> Similarly, the ID is either an "interned" global variable, or an >>> immediate (-DIMID). >>> >>> The results are very different on my machine depending on this >aspect. >>> My conclusions: >>> >>> - Both three shifts with masking, two shifts with a "fallback slot" >>> (allowing for a single collision), three shifts, two shifts with >>> two masks allows for constructing good vtables. In the case of only >>> two shifts, one colliding method gets the twoshift+fback >>> performance and the rest gets the twoshift performance. >>> >>> - Performance is really more affected by whether hashes are >>> immediates or global variables than the hash function. This is in >>> contrast to the interning vs. key benchmarks -- so I think that if >>> we looked up the vtable through PyTypeObject, rather than getting >>> the vtable directly, the loads of the global variables could >>> potentially be masked by that. >>> >>> - My conclusion: Just use lower bits of md5 *both* for the hashing >>> and the ID-ing (don't bother with any interning), and compile the >>> thing as a 64-bit immediate. This can cause crashes/stack smashes >>> etc. if there's lower-64bit-of-md5 collisions, but a) the >>> probability is incredibly small, b) it would only matter in >>> situations that should cause an AttributeError anyway, c) if we >>> really care, we can always use an interning-like mechanism to >>> validate on module loading that its hashes doesn't collide with >>> other hashes (and raise an exception "Congratulations, you've >>> discovered a phenomenal md5 collision, get in touch with cython >>> devs and we'll work around it right away"). > >Due to the birthday paradox, this seems a bit risky. Maybe it's >because I regularly work with collections much bigger than 2^32, and I >suppose we're talking about unique method names and signatures here, >but still... I wonder what the penalty would be for checking the full >128 bit hash. (Storing it could allow for greater entropy in the >optimal hash table search as well). > >> What I forgot to mention: >> >> - I really want to avoid linear probing just because of the code >bloat in >> call sites. > >That's a good point. What about flags--are we throwing out the idea of >masking? > >> With two shifts, when there was a failure to find a perfect hash >> it was always possible to find one with a single collision. >> >> - Probing for the hash with two shifts is lightning fast, it can >take a >> while with three shifts (though you can always spend more memory on a >bigger >> table to make it fast again). However, it makes me uneasy to penalize >the >> performance of calling one of the random methods, so I'm really in >favour of >> three-shifts or double-mask (to be decided when investigating the >> performance of probing for parameters in more detail). >> >> - I tried using SSE to do shifts in parallel and failed (miserable >> performance). The problem is quickly moving things between general >purpose >> registers and SSE registers, and the lack of SSE immediates/constants >in the >> instruction stream. At least, what my gcc 4.6 generates appeared to >use the >> stack to communicate between SSE registers and general purpose >registers >> (but I can't have been doing the right thing..). >> >> >> >>> >>> The RTTI (i.e. the char*) is also put in there, but is not used for >>> comparison and is not interned. >>> >>> At least, that's what I think we should do for duck-style vtables. >>> >>> Do we then go to all the pain of defining key-encoding, interning >>> etc. just for SEP 201? Isn't it easier to just mandate a md5 >dependency >>> and be done with it? (After all, md5 usually comes with Python in >the >>> md5 and hashlib modules) >>> >>> direct: Early-binding >>> index: Call slot 0 (C++-style vtable/function pointer) >>> noshift: h & m1 >>> oneshift: (h >> r1) & m1 >>> twoshift: ((h >> r1) ^ (h >> r2)) & m1 >>> twoshift+fback: hash doesn't >> >> >> I meant: Hash collision and then, after a branch miss, look up the >one >> fallback slot in the vtable header. > >We could also do a fallback table. Usually it'd be empty, Occasionally >it'd have one element in it. It'd always be possible to make this big >enough to avoid collisions in a worst-case scenario. > >BTW, this is a general static char* -> void* dictionary, I bet it >could possibly have other uses. (It may also be a well-studied >problem, though a bit hard to search for...) I suppose we could reduce >it to read-optimized int -> int mappings. The C FAQ says 'if you know the contents of your hash table up front you can devise a perfect hash', but no details, probably just hand-waving. 128 bits gives more entropy for perfect hashing: some but not much since each shift r is hardwired to one 64 bit subset. From the interning/key benchmarks, checking the full 128 bits would probably not be noticeable in microbenchmarks, it's more about using an extra register and bloating the instruction cache and data cache a bit etc, stuff that can only be measured in production. The alternative is having a collision detection registry. If it complains, you're told where to edit Cython (perhaps a datafile) so that the pre-hash function changes: if signature equals 'foo:ddffi' # known collision with 'bar:ii' Use high 64 bits of md5 Else: Use low 64 bits of md5 Each such collision is documented in the cep/sep. But 128 bit and then relying on luck is perhaps simpler... If we need flags, lets say that 92 bits suffice. for hash and use 16 for flags... But i was thinking that you'd have separate tables for nogil callers and gil-holding callers so that you didn't need to scan for matching flags. We really want this to be branch-miss-free. Still, flags are good for error return codes etc. Do you agree on forgetting about the encoded keys/interning even for SEP 201? There's only so much effort to go around and I'd much rather use md5 and these hash tables everywhere. Dag > >- Robert >_______________________________________________ >cython-devel mailing list >cython-devel@python.org >http://mail.python.org/mailman/listinfo/cython-devel -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel