On Tue, Jun 5, 2012 at 2:41 PM, Dag Sverre Seljebotn <d.s.seljeb...@astro.uio.no> wrote: > On 06/05/2012 10:50 PM, Robert Bradshaw wrote: >> >> On Tue, Jun 5, 2012 at 1:10 PM, Dag Sverre Seljebotn >> <d.s.seljeb...@astro.uio.no> wrote: >>> >>> On 06/04/2012 11:43 PM, Robert Bradshaw wrote: >>>> >>>> >>>> On Mon, Jun 4, 2012 at 1:55 PM, Dag Sverre Seljebotn >>>> <d.s.seljeb...@astro.uio.no> wrote: >>>>> >>>>> >>>>> On 06/04/2012 09:44 PM, Dag Sverre Seljebotn wrote: >>>>>> >>>>>> >>>>>> >>>>>> Me and Robert had a long discussion on the NumFOCUS list about this >>>>>> already, but I figured it was better to continue it and provide more >>>>>> in-depth benchmark results here. >>>>>> >>>>>> It's basically a new idea of how to provide a vtable based on perfect >>>>>> hashing, which should be a lot simpler to implement than what I first >>>>>> imagined. >>>>>> >>>>>> I'll write down some context first, if you're familiar with this >>>>>> skip ahead a bit.. >>>>>> >>>>>> This means that you can do fast dispatches *without* the messy >>>>>> business of binding vtable slots at compile time. To be concrete, this >>>>>> might e.g. take the form >>>>>> >>>>>> def f(obj): >>>>>> obj.method(3.4) # try to find a vtable with "void method(double)" in >>>>>> it >>>>>> >>>>>> or, a more typed approach, >>>>>> >>>>>> # File A >>>>>> cdef class MyImpl: >>>>>> def double method(double x): return x * x >>>>>> >>>>>> # File B >>>>>> # Here we never know about MyImpl, hence "duck-typed" >>>>>> @cython.interface >>>>>> class MyIntf: >>>>>> def double method(double x): pass >>>>>> >>>>>> def f(MyIntf obj): >>>>>> # obj *can* be MyImpl instance, or whatever else that supports >>>>>> # that interface >>>>>> obj.method(3.4) >>>>>> >>>>>> >>>>>> Now, the idea to implement this is: >>>>>> >>>>>> a) Both caller and callee pre-hash name/argument string >>>>>> "mymethod:iidd" to 64 bits of hash data (probably lower 64 bits of >>>>>> md5) >>>>>> >>>>>> b) Callee (MyImpl) generates a vtable of its methods by *perfect* >>>>>> hashing. What you do is define a final hash fh as a function >>>>>> of the pre-hash ph, for instance >>>>>> >>>>>> fh = ((ph>> vtable.r1) ^ (ph>> vtable.r2) ^ (ph>> >>>>>> vtable.r3))& >>>>>> vtable.m >>>>>> >>>>>> (Me and Robert are benchmarking different functions to use here.) By >>>>>> playing with r1, r2, r3, you have 64**3 choices of hash function, and >>>>>> will be able to pick a combination which gives *no* (or very few) >>>>>> collisions. >>>>>> >>>>>> c) Caller then combines the pre-hash generated at compile-time, with >>>>>> r1, r2, r3, m stored in the vtable header, in order to find the >>>>>> final location in the hash-table. >>>>>> >>>>>> The exciting thing is that in benchmark, the performance penalty is >>>>>> actually very slight over a C++-style v-table. (Of course you can >>>>>> cache a proper vtable, but the fact that you get so close without >>>>>> caring about caching means that this can be done much faster.) >>>> >>>> >>>> >>>> One advantage about caching a vtable is that one can possibly put in >>>> adapters for non-exact matches. It also opens up the possibility of >>>> putting in stubs to call def methods if they exist. This needs to be >>>> fleshed out more, (another CEP :) but could provide for a >>>> backwards-compatible easy first implementation. >>>> >>>>>> Back to my and Robert's discussion on benchmarks: >>>>>> >>>>>> I've uploaded benchmarks here: >>>>>> >>>>>> https://github.com/dagss/hashvtable/tree/master/dispatchbench >>>>>> >>>>>> I've changed the benchmark taking to give more robust numbers (at >>>>>> least for me), you want to look at the 'min' column. >>>>>> >>>>>> I changed the benchmark a bit so that it benchmarks a *callsite*. >>>>>> So we don't pass 'h' on the stack, but either a) looks it up in a >>>>>> global >>>>>> variable (default), or b) it's a compile-time constant (immediate in >>>>>> assembly) (compile with -DIMHASH). >>>>>> >>>>>> Similarly, the ID is either an "interned" global variable, or an >>>>>> immediate (-DIMID). >>>>>> >>>>>> The results are very different on my machine depending on this aspect. >>>>>> My conclusions: >>>>>> >>>>>> - Both three shifts with masking, two shifts with a "fallback slot" >>>>>> (allowing for a single collision), three shifts, two shifts with >>>>>> two masks allows for constructing good vtables. In the case of only >>>>>> two shifts, one colliding method gets the twoshift+fback >>>>>> performance and the rest gets the twoshift performance. >>>>>> >>>>>> - Performance is really more affected by whether hashes are >>>>>> immediates or global variables than the hash function. This is in >>>>>> contrast to the interning vs. key benchmarks -- so I think that if >>>>>> we looked up the vtable through PyTypeObject, rather than getting >>>>>> the vtable directly, the loads of the global variables could >>>>>> potentially be masked by that. >>>>>> >>>>>> - My conclusion: Just use lower bits of md5 *both* for the hashing >>>>>> and the ID-ing (don't bother with any interning), and compile the >>>>>> thing as a 64-bit immediate. This can cause crashes/stack smashes >>>>>> etc. if there's lower-64bit-of-md5 collisions, but a) the >>>>>> probability is incredibly small, b) it would only matter in >>>>>> situations that should cause an AttributeError anyway, c) if we >>>>>> really care, we can always use an interning-like mechanism to >>>>>> validate on module loading that its hashes doesn't collide with >>>>>> other hashes (and raise an exception "Congratulations, you've >>>>>> discovered a phenomenal md5 collision, get in touch with cython >>>>>> devs and we'll work around it right away"). >>>> >>>> >>>> >>>> Due to the birthday paradox, this seems a bit risky. Maybe it's >>>> because I regularly work with collections much bigger than 2^32, and I >>>> suppose we're talking about unique method names and signatures here, >>>> but still... I wonder what the penalty would be for checking the full >>>> 128 bit hash. (Storing it could allow for greater entropy in the >>>> optimal hash table search as well). >>> >>> >>> >>> Wonder no more. Here's the penalty for different bit-lengths, all >>> compile-time constants: >>> >>> threeshift: min=6.08e-09 mean=6.11e-09 std=2.81e-11 >>> val=1200000000.000000 >>> threeshift96: min=7.53e-09 mean=7.55e-09 std=1.96e-11 >>> val=1200000000.000000 >>> threeshift128: min=6.95e-09 mean=6.97e-09 std=2.57e-11 >>> val=1200000000.000000 >>> threeshift160: min=8.17e-09 mean=8.23e-09 std=4.06e-11 >>> val=1200000000.000000 >>> >>> And for comparison, when loading the comparison IDs from global variable: >>> >>> threeshift: min=6.46e-09 mean=6.52e-09 std=4.95e-11 >>> val=1200000000.000000 >>> threeshift96: min=8.07e-09 mean=8.16e-09 std=4.55e-11 >>> val=1200000000.000000 >>> threeshift128: min=8.06e-09 mean=8.18e-09 std=6.71e-11 >>> val=1200000000.000000 >>> threeshift160: min=9.71e-09 mean=9.83e-09 std=5.12e-11 >>> val=1200000000.000000 >>> >>> So indeed, >>> >>> 64-bit hash< interning< 128 bit hash >>> >>> (At least on my Intel Nehalem Core i7 1.87GhZ) >>> >>> And the load of the global variable may in real life be hidden by other >>> things going on in the function. >>> >>> And, you save vtable memory by having an interned char* and not saving >>> the >>> hash in the vtable. >> >> >> I'm OK with using the 64-bit hash with a macro to enable further >> checking. If it becomes an issue, we can partition the vtable into two >> separate structures (hash64/pointer/flags? + hash160/char*/metadata). >> That's probably overkill. With an eye to security, perhaps the spec >> should be sha1 (or sha2?, not sure if that ships with Python). > > > No, I like splitting up the table, I was assuming we'd stick the char* in a > different table anyway. Cache is precious, and the second table would be > completely cold in most situations. > > Is the goal then to avoid having to have an interning registry?
Yes, and to avoid invoking an expensive hash function at runtime in order to achieve good distribution. > Something that hasn't come up so far is that Cython doesn't know the exact > types of external typedefs, so it can't generate the hash at Cythonize-time. > I guess some support for build systems to probe for type sizes and compute > the signature hashes in a sepearate header file would solve this -- with a > fallback to computing them runtime at module loading, if you're not using a > supported build system. (But suddenly an interning registry doesn't look so > horrible..) It all depends on how strict you want to be. It may be acceptable to let f(int) and f(long) not hash to the same value even if sizeof(int) == sizeof(long). We could also promote all int types to long or long long, including extern times (assuming, with a c-compile-time check, external types declared up to "long" are <= sizeof(long)). Another option is to let the hash be md5(sig) + hashN(sizeof(extern_arg1), sizeof(extern_argN)) where hashN is a macro. > Really, I think a micro-benchmark is rather pessimistic about the > performance of loading a global variable -- if more stuff happens around the > call site then the load will likely be moved ahead and the latency hidden. > Perhaps this might even be the case just for going the route through > extensibletypeobject. > > >>> They should be made more easily runnable so that we could run them on >>> various systems, but it makes sense to first read up on and figure out >>> which >>> hash functions are really viable, to keep the number of numbers down. >>> >>> I just realized that I never pushed the changes I did to introduce >>> -DIMHASH/-DIMID etc., but the benchmarks are pushed now. >>> >>> >>> >>>> We could also do a fallback table. Usually it'd be empty, Occasionally >>>> it'd have one element in it. It'd always be possible to make this big >>>> enough to avoid collisions in a worst-case scenario. >>> >>> >>> >>> If you do a fallback table it's as much code in the call site as linear >>> probing... >> >> >> Is linear probing that bad? It's an extra increment and compare in the >> miss case. >> >>> But when I played with the generation side, a failure to create a table >>> at a >>> given size would *always* be due to a single collision. This is what I >>> did >>> in the twoshift+fback benchmark. >> >> >> But it won't always be. One can always increase the size of the main >> table however, if two collisions are rare enough. > > > Yes of course, I didn't test 100% fill of a 64-entry table. I was more > concerned with making the table 128 or 256 rather than having to go to 512 > :-) > > >>>> Duplicate tables works as long as there aren't too many orthogonal >>>> considerations. Is the GIL the only one? What about "I can propagate >>>> errors?" Now we're up to 4 tables... >>> >>> >>> Would your decision of whether or not to dispatch to a function depend on >>> whether or not it propagates errors? >>> >>> I'm thinking of the "with gil" function case, i.e. callee has: >>> >>> a) Function to call if you have the GIL >>> b) GIL-acquiring wrapper >>> >>> and you want GIL-holding code to call a) and nogil code to call b). >>> >>> But one could just make the caller acquire the GIL if needed (which in >>> that >>> case is so expensive anyway that it can be made the unlikely() path). >> >> >> Are you saying you'd add code to the call site to determine if it >> needs (and conditionally acquire) the GIL? > > > Well, I'm saying it's an alternative, I'm not sure if it has merit. > Basically shift the "with gil" responsibility to the caller in this case. > > >> >>> I can't think of other situations where you would pick which function to >>> call based on flags. >> >> >> If the caller doesn't propagate errors, it may want to have different >> codepaths depending on whether the callee propagates them. > > > Not sure if I understand. Would you call a *different* incarnation of the > callee depending on this, and need different function pointers for different > callers? > > Otherwise you just check flags after the call and take the appropriate > action, with a likely() around the likely one. You need flags, but not a > different table. Fair enough. - Robert _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel