On 06/05/2012 10:50 PM, Robert Bradshaw wrote:
On Tue, Jun 5, 2012 at 1:10 PM, Dag Sverre Seljebotn
<d.s.seljeb...@astro.uio.no> wrote:
On 06/04/2012 11:43 PM, Robert Bradshaw wrote:
On Mon, Jun 4, 2012 at 1:55 PM, Dag Sverre Seljebotn
<d.s.seljeb...@astro.uio.no> wrote:
On 06/04/2012 09:44 PM, Dag Sverre Seljebotn wrote:
Me and Robert had a long discussion on the NumFOCUS list about this
already, but I figured it was better to continue it and provide more
in-depth benchmark results here.
It's basically a new idea of how to provide a vtable based on perfect
hashing, which should be a lot simpler to implement than what I first
imagined.
I'll write down some context first, if you're familiar with this
skip ahead a bit..
This means that you can do fast dispatches *without* the messy
business of binding vtable slots at compile time. To be concrete, this
might e.g. take the form
def f(obj):
obj.method(3.4) # try to find a vtable with "void method(double)" in it
or, a more typed approach,
# File A
cdef class MyImpl:
def double method(double x): return x * x
# File B
# Here we never know about MyImpl, hence "duck-typed"
@cython.interface
class MyIntf:
def double method(double x): pass
def f(MyIntf obj):
# obj *can* be MyImpl instance, or whatever else that supports
# that interface
obj.method(3.4)
Now, the idea to implement this is:
a) Both caller and callee pre-hash name/argument string
"mymethod:iidd" to 64 bits of hash data (probably lower 64 bits of
md5)
b) Callee (MyImpl) generates a vtable of its methods by *perfect*
hashing. What you do is define a final hash fh as a function
of the pre-hash ph, for instance
fh = ((ph>> vtable.r1) ^ (ph>> vtable.r2) ^ (ph>> vtable.r3))&
vtable.m
(Me and Robert are benchmarking different functions to use here.) By
playing with r1, r2, r3, you have 64**3 choices of hash function, and
will be able to pick a combination which gives *no* (or very few)
collisions.
c) Caller then combines the pre-hash generated at compile-time, with
r1, r2, r3, m stored in the vtable header, in order to find the
final location in the hash-table.
The exciting thing is that in benchmark, the performance penalty is
actually very slight over a C++-style v-table. (Of course you can
cache a proper vtable, but the fact that you get so close without
caring about caching means that this can be done much faster.)
One advantage about caching a vtable is that one can possibly put in
adapters for non-exact matches. It also opens up the possibility of
putting in stubs to call def methods if they exist. This needs to be
fleshed out more, (another CEP :) but could provide for a
backwards-compatible easy first implementation.
Back to my and Robert's discussion on benchmarks:
I've uploaded benchmarks here:
https://github.com/dagss/hashvtable/tree/master/dispatchbench
I've changed the benchmark taking to give more robust numbers (at
least for me), you want to look at the 'min' column.
I changed the benchmark a bit so that it benchmarks a *callsite*.
So we don't pass 'h' on the stack, but either a) looks it up in a global
variable (default), or b) it's a compile-time constant (immediate in
assembly) (compile with -DIMHASH).
Similarly, the ID is either an "interned" global variable, or an
immediate (-DIMID).
The results are very different on my machine depending on this aspect.
My conclusions:
- Both three shifts with masking, two shifts with a "fallback slot"
(allowing for a single collision), three shifts, two shifts with
two masks allows for constructing good vtables. In the case of only
two shifts, one colliding method gets the twoshift+fback
performance and the rest gets the twoshift performance.
- Performance is really more affected by whether hashes are
immediates or global variables than the hash function. This is in
contrast to the interning vs. key benchmarks -- so I think that if
we looked up the vtable through PyTypeObject, rather than getting
the vtable directly, the loads of the global variables could
potentially be masked by that.
- My conclusion: Just use lower bits of md5 *both* for the hashing
and the ID-ing (don't bother with any interning), and compile the
thing as a 64-bit immediate. This can cause crashes/stack smashes
etc. if there's lower-64bit-of-md5 collisions, but a) the
probability is incredibly small, b) it would only matter in
situations that should cause an AttributeError anyway, c) if we
really care, we can always use an interning-like mechanism to
validate on module loading that its hashes doesn't collide with
other hashes (and raise an exception "Congratulations, you've
discovered a phenomenal md5 collision, get in touch with cython
devs and we'll work around it right away").
Due to the birthday paradox, this seems a bit risky. Maybe it's
because I regularly work with collections much bigger than 2^32, and I
suppose we're talking about unique method names and signatures here,
but still... I wonder what the penalty would be for checking the full
128 bit hash. (Storing it could allow for greater entropy in the
optimal hash table search as well).
Wonder no more. Here's the penalty for different bit-lengths, all
compile-time constants:
threeshift: min=6.08e-09 mean=6.11e-09 std=2.81e-11
val=1200000000.000000
threeshift96: min=7.53e-09 mean=7.55e-09 std=1.96e-11
val=1200000000.000000
threeshift128: min=6.95e-09 mean=6.97e-09 std=2.57e-11
val=1200000000.000000
threeshift160: min=8.17e-09 mean=8.23e-09 std=4.06e-11
val=1200000000.000000
And for comparison, when loading the comparison IDs from global variable:
threeshift: min=6.46e-09 mean=6.52e-09 std=4.95e-11
val=1200000000.000000
threeshift96: min=8.07e-09 mean=8.16e-09 std=4.55e-11
val=1200000000.000000
threeshift128: min=8.06e-09 mean=8.18e-09 std=6.71e-11
val=1200000000.000000
threeshift160: min=9.71e-09 mean=9.83e-09 std=5.12e-11
val=1200000000.000000
So indeed,
64-bit hash< interning< 128 bit hash
(At least on my Intel Nehalem Core i7 1.87GhZ)
And the load of the global variable may in real life be hidden by other
things going on in the function.
And, you save vtable memory by having an interned char* and not saving the
hash in the vtable.
I'm OK with using the 64-bit hash with a macro to enable further
checking. If it becomes an issue, we can partition the vtable into two
separate structures (hash64/pointer/flags? + hash160/char*/metadata).
That's probably overkill. With an eye to security, perhaps the spec
should be sha1 (or sha2?, not sure if that ships with Python).
No, I like splitting up the table, I was assuming we'd stick the char*
in a different table anyway. Cache is precious, and the second table
would be completely cold in most situations.
Is the goal then to avoid having to have an interning registry?
Something that hasn't come up so far is that Cython doesn't know the
exact types of external typedefs, so it can't generate the hash at
Cythonize-time. I guess some support for build systems to probe for type
sizes and compute the signature hashes in a sepearate header file would
solve this -- with a fallback to computing them runtime at module
loading, if you're not using a supported build system. (But suddenly an
interning registry doesn't look so horrible..)
Really, I think a micro-benchmark is rather pessimistic about the
performance of loading a global variable -- if more stuff happens around
the call site then the load will likely be moved ahead and the latency
hidden. Perhaps this might even be the case just for going the route
through extensibletypeobject.
They should be made more easily runnable so that we could run them on
various systems, but it makes sense to first read up on and figure out which
hash functions are really viable, to keep the number of numbers down.
I just realized that I never pushed the changes I did to introduce
-DIMHASH/-DIMID etc., but the benchmarks are pushed now.
We could also do a fallback table. Usually it'd be empty, Occasionally
it'd have one element in it. It'd always be possible to make this big
enough to avoid collisions in a worst-case scenario.
If you do a fallback table it's as much code in the call site as linear
probing...
Is linear probing that bad? It's an extra increment and compare in the
miss case.
But when I played with the generation side, a failure to create a table at a
given size would *always* be due to a single collision. This is what I did
in the twoshift+fback benchmark.
But it won't always be. One can always increase the size of the main
table however, if two collisions are rare enough.
Yes of course, I didn't test 100% fill of a 64-entry table. I was more
concerned with making the table 128 or 256 rather than having to go to
512 :-)
Duplicate tables works as long as there aren't too many orthogonal
considerations. Is the GIL the only one? What about "I can propagate
errors?" Now we're up to 4 tables...
Would your decision of whether or not to dispatch to a function depend on
whether or not it propagates errors?
I'm thinking of the "with gil" function case, i.e. callee has:
a) Function to call if you have the GIL
b) GIL-acquiring wrapper
and you want GIL-holding code to call a) and nogil code to call b).
But one could just make the caller acquire the GIL if needed (which in that
case is so expensive anyway that it can be made the unlikely() path).
Are you saying you'd add code to the call site to determine if it
needs (and conditionally acquire) the GIL?
Well, I'm saying it's an alternative, I'm not sure if it has merit.
Basically shift the "with gil" responsibility to the caller in this case.
I can't think of other situations where you would pick which function to
call based on flags.
If the caller doesn't propagate errors, it may want to have different
codepaths depending on whether the callee propagates them.
Not sure if I understand. Would you call a *different* incarnation of
the callee depending on this, and need different function pointers for
different callers?
Otherwise you just check flags after the call and take the appropriate
action, with a likely() around the likely one. You need flags, but not a
different table.
Dag
_______________________________________________
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel