On Mon, Jun 4, 2012 at 1:55 PM, Dag Sverre Seljebotn <d.s.seljeb...@astro.uio.no> wrote: > On 06/04/2012 09:44 PM, Dag Sverre Seljebotn wrote: >> >> Me and Robert had a long discussion on the NumFOCUS list about this >> already, but I figured it was better to continue it and provide more >> in-depth benchmark results here. >> >> It's basically a new idea of how to provide a vtable based on perfect >> hashing, which should be a lot simpler to implement than what I first >> imagined. >> >> I'll write down some context first, if you're familiar with this >> skip ahead a bit.. >> >> This means that you can do fast dispatches *without* the messy >> business of binding vtable slots at compile time. To be concrete, this >> might e.g. take the form >> >> def f(obj): >> obj.method(3.4) # try to find a vtable with "void method(double)" in it >> >> or, a more typed approach, >> >> # File A >> cdef class MyImpl: >> def double method(double x): return x * x >> >> # File B >> # Here we never know about MyImpl, hence "duck-typed" >> @cython.interface >> class MyIntf: >> def double method(double x): pass >> >> def f(MyIntf obj): >> # obj *can* be MyImpl instance, or whatever else that supports >> # that interface >> obj.method(3.4) >> >> >> Now, the idea to implement this is: >> >> a) Both caller and callee pre-hash name/argument string >> "mymethod:iidd" to 64 bits of hash data (probably lower 64 bits of >> md5) >> >> b) Callee (MyImpl) generates a vtable of its methods by *perfect* >> hashing. What you do is define a final hash fh as a function >> of the pre-hash ph, for instance >> >> fh = ((ph >> vtable.r1) ^ (ph >> vtable.r2) ^ (ph >> vtable.r3)) & >> vtable.m >> >> (Me and Robert are benchmarking different functions to use here.) By >> playing with r1, r2, r3, you have 64**3 choices of hash function, and >> will be able to pick a combination which gives *no* (or very few) >> collisions. >> >> c) Caller then combines the pre-hash generated at compile-time, with >> r1, r2, r3, m stored in the vtable header, in order to find the >> final location in the hash-table. >> >> The exciting thing is that in benchmark, the performance penalty is >> actually very slight over a C++-style v-table. (Of course you can >> cache a proper vtable, but the fact that you get so close without >> caring about caching means that this can be done much faster.)
One advantage about caching a vtable is that one can possibly put in adapters for non-exact matches. It also opens up the possibility of putting in stubs to call def methods if they exist. This needs to be fleshed out more, (another CEP :) but could provide for a backwards-compatible easy first implementation. >> Back to my and Robert's discussion on benchmarks: >> >> I've uploaded benchmarks here: >> >> https://github.com/dagss/hashvtable/tree/master/dispatchbench >> >> I've changed the benchmark taking to give more robust numbers (at >> least for me), you want to look at the 'min' column. >> >> I changed the benchmark a bit so that it benchmarks a *callsite*. >> So we don't pass 'h' on the stack, but either a) looks it up in a global >> variable (default), or b) it's a compile-time constant (immediate in >> assembly) (compile with -DIMHASH). >> >> Similarly, the ID is either an "interned" global variable, or an >> immediate (-DIMID). >> >> The results are very different on my machine depending on this aspect. >> My conclusions: >> >> - Both three shifts with masking, two shifts with a "fallback slot" >> (allowing for a single collision), three shifts, two shifts with >> two masks allows for constructing good vtables. In the case of only >> two shifts, one colliding method gets the twoshift+fback >> performance and the rest gets the twoshift performance. >> >> - Performance is really more affected by whether hashes are >> immediates or global variables than the hash function. This is in >> contrast to the interning vs. key benchmarks -- so I think that if >> we looked up the vtable through PyTypeObject, rather than getting >> the vtable directly, the loads of the global variables could >> potentially be masked by that. >> >> - My conclusion: Just use lower bits of md5 *both* for the hashing >> and the ID-ing (don't bother with any interning), and compile the >> thing as a 64-bit immediate. This can cause crashes/stack smashes >> etc. if there's lower-64bit-of-md5 collisions, but a) the >> probability is incredibly small, b) it would only matter in >> situations that should cause an AttributeError anyway, c) if we >> really care, we can always use an interning-like mechanism to >> validate on module loading that its hashes doesn't collide with >> other hashes (and raise an exception "Congratulations, you've >> discovered a phenomenal md5 collision, get in touch with cython >> devs and we'll work around it right away"). Due to the birthday paradox, this seems a bit risky. Maybe it's because I regularly work with collections much bigger than 2^32, and I suppose we're talking about unique method names and signatures here, but still... I wonder what the penalty would be for checking the full 128 bit hash. (Storing it could allow for greater entropy in the optimal hash table search as well). > What I forgot to mention: > > - I really want to avoid linear probing just because of the code bloat in > call sites. That's a good point. What about flags--are we throwing out the idea of masking? > With two shifts, when there was a failure to find a perfect hash > it was always possible to find one with a single collision. > > - Probing for the hash with two shifts is lightning fast, it can take a > while with three shifts (though you can always spend more memory on a bigger > table to make it fast again). However, it makes me uneasy to penalize the > performance of calling one of the random methods, so I'm really in favour of > three-shifts or double-mask (to be decided when investigating the > performance of probing for parameters in more detail). > > - I tried using SSE to do shifts in parallel and failed (miserable > performance). The problem is quickly moving things between general purpose > registers and SSE registers, and the lack of SSE immediates/constants in the > instruction stream. At least, what my gcc 4.6 generates appeared to use the > stack to communicate between SSE registers and general purpose registers > (but I can't have been doing the right thing..). > > > >> >> The RTTI (i.e. the char*) is also put in there, but is not used for >> comparison and is not interned. >> >> At least, that's what I think we should do for duck-style vtables. >> >> Do we then go to all the pain of defining key-encoding, interning >> etc. just for SEP 201? Isn't it easier to just mandate a md5 dependency >> and be done with it? (After all, md5 usually comes with Python in the >> md5 and hashlib modules) >> >> direct: Early-binding >> index: Call slot 0 (C++-style vtable/function pointer) >> noshift: h & m1 >> oneshift: (h >> r1) & m1 >> twoshift: ((h >> r1) ^ (h >> r2)) & m1 >> twoshift+fback: hash doesn't > > > I meant: Hash collision and then, after a branch miss, look up the one > fallback slot in the vtable header. We could also do a fallback table. Usually it'd be empty, Occasionally it'd have one element in it. It'd always be possible to make this big enough to avoid collisions in a worst-case scenario. BTW, this is a general static char* -> void* dictionary, I bet it could possibly have other uses. (It may also be a well-studied problem, though a bit hard to search for...) I suppose we could reduce it to read-optimized int -> int mappings. - Robert _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel