>As has been discovered, Python hashtables are pretty good, so expect
>at most a 10x (?) improvement writing your own.
Just to tighten this claim up, I have a fairly optimal C string hash
table that is only about 2.8X faster (in a Cython environment) than
Python's (also in the same Cython environment), at least at L2 cache
and main memory scales. { My table is also about 3X faster than the
kind of lame one in g++-4 series STL, which means for strings anyway,
the Py one is "about as good as g++" - that may not be saying much,
but g++'s is one widely used, well-known alternative, anyway. }
For hashing, an intrinsically "non local" sort of computation structure,
once tables no longer fit in core RAM, the actual time profile changes
drastically, with a multiplier more like 10ms/60ns = 166000X or maybe
if you have 50us latency SSD drives for swap only 50us/60ns =~ 833X.
In memory usage, Python objects are huge. Robert points out one way
you may be able to save memory. On an x86-64, Python string objects
are typically >64 B, for example. So, if you were actually keeping
short 4-byte strings as a memory optimization you might get something
like an 16X reduction in memory, or be able to reach problem sizes
16X larger before falling over the cliff of external memory slow-downs.
So, either a very C-like Cython or a full C solution could allow you
to do problem sizes 16X bigger or so. (A 10X memory optimization is
one of the few cases lately where I've had to go to a full C solution
instead of a Cython solution in my own work). When everything does
fit in the same level of the memory hierarchy, 3X is likely to be
the best you will improve over Python dictionaries.
--
FWIW, Bob Jenkin's hash function is good, but 6..7x slower than
what you might need. Paul Hsieh has a slightly faster one, and I
have one still faster:
core2_cycles i7_cycles (n=num of key bytes)
Bob Jenkins 27 + 2.4*n 30.3 + 2.26 * n
Paul Hsieh 24 + 1.9*n 20.0 + 1.69 * n
cb2 9 + 0.41*n 8.4 + 0.32 * n
Numbers above are from robust linear regressions over many randomized
trials and very stable. { Mine is really just a Knott-style (as attributed
by Knuth circa 1970) hash, modified for word-wise/8byte loops over
byte-wise aligned data and with some constants selected for English
language ASCII. } I've done a lot of different key set distributional
checking on dozens of sets of real world key sets. The extra expense
of the others does not seem to generally make any difference in either
average or maximum collision rates, though I'm sure in pathological
cases they can perform better.
It sounds like Sanne wants integer keys, anyway. It may make
sense to copy Python's integer hash function, though you should be
careful about how that interacts with Python's reduction from a hash
code to a table address via modulus primes. Any real distribution
checking has likely only been done in an end-to-end sense.
--
All these follow-up points on the interesting discussion so far being
made, looking at the kind of structure Sanne seems to actually be
building, it is possible that he should not be using giant hash tables
at all? I'm just guessing from his variable names...
Could he perhaps use 2D NumPy arrays for his two floats -- the count
and probability, and only use two separate small hash tables to convert
between integer or string keys and "dense index coordinates".
Then he could use Numpy_count[s,t] where s=sindex[sub]; t=tindex[tgt],
and a paralle Numpy_prob[s,t]. Or even, perhaps, a 3D numpy array,
Numpy_countProb[s, t, slot] where slot 0 could be count and slot 1
could be Prob (or a recarray).
He may need to maintain the reverse mapping, sub=subspelling[ix] as
well, but that can be a simple Python list. You see a new sub, you
append it to the subList and set sindex[sub] = len(subList) and
analogously for targetList and tindex.
It may be necessary to pre-allocate the address space in the Numpy array
with some kind of bound for how many tindex and sindex values can happen,
but without this bound if things blow out RAM the algo with come to a
screeching halt anyway. So this kind of bounding may be intrinsic to
a problem this big for that machine.
Whether this sort of thing might work may depend upon how "dynamic" the
container in question is during the course of his processing. E.g.,
whether he needs to dynamically 'del' columns or rows and rely on them
being actually "gone" or just growing up an indexing structure and
allocating new rows and new columns as he goes. It seems conceivable
that he may only be doing the latter and then not need any external hash
library mojo and, honestly, very little C-style Cython to save on memory
since Numpy objects are already quite dense.
It's been my experience that the ease Python (and Perl) expose for dicts
leads a lot of people to use nested hash table containers rather than as
"large address space-to-small address space" translators. Often the
translator approach is much more efficient.
Hopefully at least some of this has been useful commentary.
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev