watered-down patricia in Python

Kragen Sitaker Mon, 08 Apr 2002 00:19:38 -0700

I figured I talk enough about Patricia that I would benefit from
experimenting with it to improve my understanding of it.  So here's a
(somewhat simplified) version of the Patricia algorithm in Python.
Normal Patricia implementations work differently in the following
ways:


- they index suffixes of a larger string and therefore store pointers
  into the string rather than having the string stuck there.
- they compress the storage of the bit indices (and support longer
  strings) by storing only differences between successive bit indices
  in the nodes.
- they reduce allocation overhead by storing the terminal nodes inside
  other nodes in the tree.
- they only traverse the tree once to insert a new string, rather than 
  twice.
- less significantly, they usually count bits in big-endian order, making
  alphabetically nearby strings likely to be close together in the tree.

Here are profiling results for inserting each line of
/usr/share/dict/words on my 366MHz PII:
>>> profile.run("for word in words[1:]: insert_new_string(wordtree, word)")
         1592123 function calls in 139.860 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.950    1.950  139.850  139.850 <string>:1(?)
    45403   27.850    0.001   53.360    0.001 pat.py:44(patricia_search)
  1365106   59.710    0.000   59.710    0.000 pat.py:6(getbit)
    45403    3.820    0.000    5.810    0.000 pat.py:61(insert_new_string_at)
    45403   11.560    0.000   20.850    0.000 pat.py:71(first_bit_difference)
    45403   25.960    0.001   48.880    0.001 pat.py:79(node_that_skipped_it)
    45403    9.000    0.000  137.900    0.003 pat.py:87(insert_new_string)
        1    0.010    0.010  139.860  139.860 profile:0(for word in words[1:]: 
insert_new_string(wordtree, word))
        0    0.000             0.000          profile:0(profiler)

The Python process was using 7.8MB more memory than immediately after
its startup after building this tree, or about 175 bytes per word; and
that it was able to insert 325 words per second, moderately down from
an earlier test in which it inserted 370 words per second for 1.89
seconds (700 words, up to a total of 1000.)

Note that this is roughly one million instructions per string
inserted, and this is a fairly small number of strings.  This is why
the above enhancements to the algorithm are desirable and why Python
is not a desirable language in which to implement high-performance
algorithms.  To look at it another way, these million instructions
amount to about 790 times the function-call overhead in Python on my
machine.

The work per string should increase as the log2 of the number of
strings; for the above two tests, that was about 9 and about 15, which
predicts a larger slowdown for the 40K-word case than I actually saw.

Here's a quick cheap-junk spell-check program that uses the Patricia
index of /usr/share/dict/words to find "possible misspellings" in the
above couple of paragraphs, and its rather amusing results:

>>> for word in string.split(paragraphs):
...     word += "\n"
...     if not lu(word) == word: print `word`, `lu(word)`
... 
'Note\n' 'Nottingham\n'
'inserted,\n' 'inserted\n'
'a\n' 'ajar\n'
'strings.\n' 'strings\n'
'This\n' 'Thiensville\n'
'Python\n' 'Pythagoras\n'
'a\n' 'ajar\n'
'high-performance\n' 'higher\n'
'algorithms.\n' 'algorithms\n'
'To\n' 'Tobago\n'
'way,\n' 'way\n'
'790\n' 'wiping\n'
'function-call\n' 'functioned\n'
'Python\n' 'Pythagoras\n'
'machine.\n' 'machine\n'
'The\n' 'Thebes\n'
'log2\n' 'logjam\n'
'strings;\n' 'strings\n'
'tests,\n' 'tests\n'
'9\n' 'yonder\n'
'15,\n' 'queried\n'
'a\n' 'ajar\n'
'40K-word\n' 'though\n'
'I\n' 'Izvestia\n'
'saw.\n' 'sawfish\n'

(Yes, "a" and "I" are not in /usr/share/dict/words.)


#!/usr/bin/python
# patricia trees.

trace_getbit = 0

def getbit(astring, index):
    "Get bit N from astring, little-endianly."
    if trace_getbit: print "getting bit %d from %s" % (index, `astring`)
    # When walking the tree looking for a key not yet in there,
    # we may find ourselves walking among nodes whose keys are longer
    # than our own.  In an ordinary lookup situation, we could return
    # "not found" at this point, but if we're going to find a string
    # that matches "as closely as possible" so we can find the first
    # mismatched bit, it's important to find *something*.  So we just
    # return zeroes.
    try: relevantchar = astring[index / 8]
    except IndexError: return 0
    return not not (ord(relevantchar) & (1 << (index % 8)))

assert getbit("A", 0)
assert not getbit("A", 1)
assert not getbit("A", 2)
assert getbit("A", 6)
assert not getbit("A", 5)
assert getbit("a", 5)
assert getbit("a", 6)
assert not getbit(" a", 6)
assert getbit(" a", 14)
assert getbit(" A", 5)
assert not getbit(" A", 13)

# an internal Patricia treenode has a bit index and left and right
# children.  An external Patricia treenode has a string.  Here we
# represent internal nodes as [bitindex, left, right] lists and external
# nodes as lists containing just the string.

# Traditionally, each node contains only a count of bits by which to
# advance, not an absolute bit index.  That makes the algorithm
# hairier, especially the update logic, but it makes it possible to
# handle larger strings.

sample_tree = [0, [1, ['hello'], ['boo']], [3, ['underwear'], ['mary']]]

def patricia_search(patnode, astring):
    while 1:
        if len(patnode) == 1: return patnode[0]
        bitindex, left, right = patnode
        if getbit(astring, bitindex): patnode = right
        else: patnode = left

def lu(word): return patricia_search(sample_tree, word)

# So how do we construct such a tree from a list of strings?
# Well, for no strings, there is no way.  You lose.
# For one string, it is just [string].
# For two strings, we start with [string1].  Then we do a search as above and
# discover that string2 is not in the tree, so we figure out that the two
# strings differ at bit N, so we need to insert a node into string1's search
# path at bit N, with its other alternative pointing at [string2].

def insert_new_string_at(patnode, newstring, bitindex):
    copy = patnode[:]
    newnode = [newstring]
    if getbit(newstring, bitindex): patnode[:] = [bitindex, copy, newnode]
    else: patnode[:] = [bitindex, newnode, copy]

# that gets called as
# insert_new_string_at(nodethatskippedit, newstring,
#                      bitindexofdifference)

def first_bit_difference(string1, string2):
    for charindex in xrange(min(len(string1), len(string2))):
        if string1[charindex] != string2[charindex]:
            bitstart = charindex * 8
            for bitindex in range(bitstart, bitstart + 8):
                if getbit(string1, bitindex) != getbit(string2, bitindex):
                    return bitindex

def node_that_skipped_it(patnode, newstring, difference_index):
    while 1:
        if len(patnode) == 1: return patnode
        bitindex, left, right = patnode
        if bitindex > difference_index: return patnode
        if getbit(newstring, bitindex): patnode = right
        else: patnode = left

def insert_new_string(patnode, newstring):
    bitindex = first_bit_difference(newstring,
                                    patricia_search(patnode, newstring))
    insert_new_string_at(node_that_skipped_it(patnode, newstring, bitindex),
                         newstring, bitindex)

-- 
<[EMAIL PROTECTED]>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
To forget the evil in oneself is to turn one's own good -- now
untethered from modesty and rendered tyrannical -- into a magnified
power for evil.  -- Steve Talbot, NETFUTURE #129, via SMART Letter

watered-down patricia in Python

Reply via email to