I figured I talk enough about Patricia that I would benefit from
experimenting with it to improve my understanding of it. So here's a
(somewhat simplified) version of the Patricia algorithm in Python.
Normal Patricia implementations work differently in the following
ways:
- they index suffixes of a larger string and therefore store pointers
into the string rather than having the string stuck there.
- they compress the storage of the bit indices (and support longer
strings) by storing only differences between successive bit indices
in the nodes.
- they reduce allocation overhead by storing the terminal nodes inside
other nodes in the tree.
- they only traverse the tree once to insert a new string, rather than
twice.
- less significantly, they usually count bits in big-endian order, making
alphabetically nearby strings likely to be close together in the tree.
Here are profiling results for inserting each line of
/usr/share/dict/words on my 366MHz PII:
>>> profile.run("for word in words[1:]: insert_new_string(wordtree, word)")
1592123 function calls in 139.860 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.950 1.950 139.850 139.850 <string>:1(?)
45403 27.850 0.001 53.360 0.001 pat.py:44(patricia_search)
1365106 59.710 0.000 59.710 0.000 pat.py:6(getbit)
45403 3.820 0.000 5.810 0.000 pat.py:61(insert_new_string_at)
45403 11.560 0.000 20.850 0.000 pat.py:71(first_bit_difference)
45403 25.960 0.001 48.880 0.001 pat.py:79(node_that_skipped_it)
45403 9.000 0.000 137.900 0.003 pat.py:87(insert_new_string)
1 0.010 0.010 139.860 139.860 profile:0(for word in words[1:]:
insert_new_string(wordtree, word))
0 0.000 0.000 profile:0(profiler)
The Python process was using 7.8MB more memory than immediately after
its startup after building this tree, or about 175 bytes per word; and
that it was able to insert 325 words per second, moderately down from
an earlier test in which it inserted 370 words per second for 1.89
seconds (700 words, up to a total of 1000.)
Note that this is roughly one million instructions per string
inserted, and this is a fairly small number of strings. This is why
the above enhancements to the algorithm are desirable and why Python
is not a desirable language in which to implement high-performance
algorithms. To look at it another way, these million instructions
amount to about 790 times the function-call overhead in Python on my
machine.
The work per string should increase as the log2 of the number of
strings; for the above two tests, that was about 9 and about 15, which
predicts a larger slowdown for the 40K-word case than I actually saw.
Here's a quick cheap-junk spell-check program that uses the Patricia
index of /usr/share/dict/words to find "possible misspellings" in the
above couple of paragraphs, and its rather amusing results:
>>> for word in string.split(paragraphs):
... word += "\n"
... if not lu(word) == word: print `word`, `lu(word)`
...
'Note\n' 'Nottingham\n'
'inserted,\n' 'inserted\n'
'a\n' 'ajar\n'
'strings.\n' 'strings\n'
'This\n' 'Thiensville\n'
'Python\n' 'Pythagoras\n'
'a\n' 'ajar\n'
'high-performance\n' 'higher\n'
'algorithms.\n' 'algorithms\n'
'To\n' 'Tobago\n'
'way,\n' 'way\n'
'790\n' 'wiping\n'
'function-call\n' 'functioned\n'
'Python\n' 'Pythagoras\n'
'machine.\n' 'machine\n'
'The\n' 'Thebes\n'
'log2\n' 'logjam\n'
'strings;\n' 'strings\n'
'tests,\n' 'tests\n'
'9\n' 'yonder\n'
'15,\n' 'queried\n'
'a\n' 'ajar\n'
'40K-word\n' 'though\n'
'I\n' 'Izvestia\n'
'saw.\n' 'sawfish\n'
(Yes, "a" and "I" are not in /usr/share/dict/words.)
#!/usr/bin/python
# patricia trees.
trace_getbit = 0
def getbit(astring, index):
"Get bit N from astring, little-endianly."
if trace_getbit: print "getting bit %d from %s" % (index, `astring`)
# When walking the tree looking for a key not yet in there,
# we may find ourselves walking among nodes whose keys are longer
# than our own. In an ordinary lookup situation, we could return
# "not found" at this point, but if we're going to find a string
# that matches "as closely as possible" so we can find the first
# mismatched bit, it's important to find *something*. So we just
# return zeroes.
try: relevantchar = astring[index / 8]
except IndexError: return 0
return not not (ord(relevantchar) & (1 << (index % 8)))
assert getbit("A", 0)
assert not getbit("A", 1)
assert not getbit("A", 2)
assert getbit("A", 6)
assert not getbit("A", 5)
assert getbit("a", 5)
assert getbit("a", 6)
assert not getbit(" a", 6)
assert getbit(" a", 14)
assert getbit(" A", 5)
assert not getbit(" A", 13)
# an internal Patricia treenode has a bit index and left and right
# children. An external Patricia treenode has a string. Here we
# represent internal nodes as [bitindex, left, right] lists and external
# nodes as lists containing just the string.
# Traditionally, each node contains only a count of bits by which to
# advance, not an absolute bit index. That makes the algorithm
# hairier, especially the update logic, but it makes it possible to
# handle larger strings.
sample_tree = [0, [1, ['hello'], ['boo']], [3, ['underwear'], ['mary']]]
def patricia_search(patnode, astring):
while 1:
if len(patnode) == 1: return patnode[0]
bitindex, left, right = patnode
if getbit(astring, bitindex): patnode = right
else: patnode = left
def lu(word): return patricia_search(sample_tree, word)
# So how do we construct such a tree from a list of strings?
# Well, for no strings, there is no way. You lose.
# For one string, it is just [string].
# For two strings, we start with [string1]. Then we do a search as above and
# discover that string2 is not in the tree, so we figure out that the two
# strings differ at bit N, so we need to insert a node into string1's search
# path at bit N, with its other alternative pointing at [string2].
def insert_new_string_at(patnode, newstring, bitindex):
copy = patnode[:]
newnode = [newstring]
if getbit(newstring, bitindex): patnode[:] = [bitindex, copy, newnode]
else: patnode[:] = [bitindex, newnode, copy]
# that gets called as
# insert_new_string_at(nodethatskippedit, newstring,
# bitindexofdifference)
def first_bit_difference(string1, string2):
for charindex in xrange(min(len(string1), len(string2))):
if string1[charindex] != string2[charindex]:
bitstart = charindex * 8
for bitindex in range(bitstart, bitstart + 8):
if getbit(string1, bitindex) != getbit(string2, bitindex):
return bitindex
def node_that_skipped_it(patnode, newstring, difference_index):
while 1:
if len(patnode) == 1: return patnode
bitindex, left, right = patnode
if bitindex > difference_index: return patnode
if getbit(newstring, bitindex): patnode = right
else: patnode = left
def insert_new_string(patnode, newstring):
bitindex = first_bit_difference(newstring,
patricia_search(patnode, newstring))
insert_new_string_at(node_that_skipped_it(patnode, newstring, bitindex),
newstring, bitindex)
--
<[EMAIL PROTECTED]> Kragen Sitaker <http://www.pobox.com/~kragen/>
To forget the evil in oneself is to turn one's own good -- now
untethered from modesty and rendered tyrannical -- into a magnified
power for evil. -- Steve Talbot, NETFUTURE #129, via SMART Letter