sorting lists of strings

Kragen Sitaker Tue, 20 Aug 2002 00:16:27 -0700

Here's some working Python code for processing "front-coded" lists of
strings; "front-coding" is a compression method in which each string
is represented as an integer specifying the number of characters at
its beginning that are the same as the characters at the beginning of
the previous string in the list, and a string that follows it
indicating what characters are different.


This code can compress or decompress such a list, in the same format
used by GNU locate (which is useful if you want to decode
/var/lib/locate/locatedb) and can merge two such sorted lists into a
larger (uncompressed) sorted list.

Ordinarily, sorting lists of strings by comparison methods (as opposed
to radix sorting) gets slower as the strings have longer common
prefixes, because each lexicographic compare operation must start at
the beginning of the strings; the merge algorithm implemented here
doesn't need to do that, because it incrementally updates its
information about the length of the common prefix of the strings as it
goes.  (It's about two orders of magnitude slower than /bin/sort on my
test dataset (30K lines of 1dfre10.txt)), though, and that's just
doing a merge, not a full sort, which would take 15 merges.  But it
*is* in Python, and it does do lots of asserts, and it does produce
uncompressed output.)

It seems to me that this algorithm might be useful for constructing
large suffix arrays with fewer random disk seeks than the algorithms I
know about.  However, I haven't finished reviewing the literature, and
these days, a corpus has to be pretty large to overflow RAM.  (I think
SDRAM presently costs about a dollar a megabyte when you include all
the attendant non-RAM costs: cases, power supplies, CPUs,
motherboards, network cards, network ports, air conditioning, and
power lines.)

The suffix-array creation method is as follows.

We merge-sort a data file of records, each representing some suffix of
the source text; each record contains a prefix length, some text
following the prefix, and additionally, a pointer into the source text
where more text can be fetched if necessary; as sorting proceeds, the
text following the prefix tends to get shorter (as records get put
next to other records with longer and longer common prefixes).  To
compensate for this, we store indices in the source text from which
more data needs to be fetched to keep the text in the data files being
sorted relatively long while we are merging, sort the indices, and
then read in data from those indices in sorted order, to be used to
augment the data file on the next pass.

In some cases, it is actually necessary to do a random seek into the
data file to get data to compare two prefixes.  It's desirable to keep
these random seeks to a minimum, because they take a lot of time; on a
modern disk with 4ms rotational latency, 7ms seek time, and 30MB/s
data transfer rate, a seek ties up the disk about as long as
transferring a third of a megabyte, so it's desirable to keep them
down to one random seek per megabyte of data to be sorted or less.

#!/usr/bin/python
# encode, decode, or merge files compressed in the format of 'frcode'
# from the GNU locate package; this could be very useful, for example,
# when you have lost a bunch of files but don't know which ones, but
# you have an old locatedb from before the crash.  You can use this
# program to decompress your old locatedb.

# TODO: compressed output from merge; translate into C; test for
# speed; extend to deal with sistrings.

# Note that this Python version is rather slow; it can only merge
# about 500 lines (of 1dfre10.txt) per second on my Pentium-366
# laptop.

import string, sys

def print_callback(astring):
    print astring
    sys.stdout.flush()

def read_locatedb_record(infile):
    lengthchar = infile.read(1)
    if lengthchar == '':
        return None
    length = ord(lengthchar)
    if length == 0x80:  # magic value indicating 16-bit length
        highbyte = infile.read(1)
        if highbyte == '': raise "end of file in extended length"
        lowbyte = infile.read(1)
        if lowbyte == '': raise "end of file in extended length"
        length = ord(highbyte) * 256 + ord(lowbyte)
        if length >= 0x8000: length = length - 0x10000
    elif length > 0x80:
        length = length - 256
    suffix = []
    while 1:
        suffixchar = infile.read(1)
        if suffixchar == '':
            raise "end of file in suffix", suffix
        if suffixchar == '\0': break
        suffix.append(suffixchar)
    return length, string.join(suffix, '')

def read_locatedb(infile, callback=print_callback,
                  initial_curlen=0, initial_curstr=''):
    curlen = initial_curlen
    curstr = initial_curstr
    while 1:
        newrec = read_locatedb_record(infile)
        if newrec is None: break
        length, suffix = newrec
        curlen = curlen + length
        curstr = curstr[:curlen] + suffix
        callback(curstr)

def merge_locatedbs(infile0, infile1, callback=print_callback):
    curlens = [0, 0]
    commonlen = 0
    infiles = [infile0, infile1]
    curstrs = ['', '']


    newrec = read_locatedb_record(infile0)
    if newrec is None:
        return read_locatedb(infile1, callback)
    else:
        curlens[0], curstrs[0] = newrec


    newrec = read_locatedb_record(infile1)
    if newrec is None:
        callback(curstrs[0])
        return read_locatedb(infile0, callback,
                             initial_curstr=curstrs[0],
                             initial_curlen=curlens[0])
    else:
        curlens[1], curstrs[1] = newrec


    while 1:
        a, b = curstrs[0], curstrs[1]
        while (len(a) > commonlen and
               len(b) > commonlen and
               a[commonlen] == b[commonlen]):
            commonlen = commonlen + 1

        assert commonlen <= len(curstrs[0])
        assert commonlen <= len(curstrs[1])

        assert curstrs[0][:commonlen] == curstrs[1][:commonlen]
        if len(curstrs[0]) > commonlen and len(curstrs[1]) > commonlen:
            assert curstrs[0][commonlen] != curstrs[1][commonlen]
        if curstrs[0][commonlen:] < curstrs[1][commonlen:]:
            which = 0
        else:
            which = 1
        callback(curstrs[which])

        newrec = read_locatedb_record(infiles[which])
        if newrec is None:  # we finished one file
            callback(curstrs[not which])
            return read_locatedb(infiles[not which], callback,
                                 initial_curstr=curstrs[not which],
                                 initial_curlen=curlens[not which])
        length, suffix = newrec
        curlens[which] = curlens[which] + length
        curstrs[which] = curstrs[which][:curlens[which]] + suffix
        a, b = curstrs[which], curstrs[not which]
        # If we have three strings x, y, z, and we know x[:xn] == y[:xn]
        # and z[:zn] == y[:zn], then we know
        # x[:min(xn, zn)] == z[:min(xn, zn)].
        # Here, x is the new curstrs[which], y is the old
        # curstrs[which], z is curstrs[not which], xn is
        # curlens[which], and zn is commonlen.
        min_newcommonlen = min(commonlen, curlens[which])
        assert (a[:min_newcommonlen] == b[:min_newcommonlen])
        if commonlen != curlens[which]:
            # This assertion says that when these two aren't equal,
            # the inner loop at the top will run for zero iterations.

            # Remember, we know not only that x[:xn] == y[:xn], but
            # that curlens[which] is the largest xn for which this is
            # true.

            # WLOG assume xn < zn.  Now, we know len(y) > xn, because
            # zn <= len(y), so we know y[xn] exists.  Now, either
            # x[xn] != y[xn] or len(x) == xn.  In either case,
            # x[:xn+1] != y[:xn+1], but since xn < zn, xn+1 <= zn, so
            # y[:xn+1] == z[:xn+1].  Therefore, x[:xn+1] != z[:xn+1].
            # So xn is the largest value foo for which x[:foo] ==
            # z[:foo].

            # The above doesn't hold when xn == zn; in that case, we
            # may have to examine more characters to find the length
            # of the longest match.
            assert (len(a) == min_newcommonlen or
                    len(b) == min_newcommonlen or
                    (a[min_newcommonlen] !=
                     b[min_newcommonlen]))
        commonlen = min_newcommonlen

# a version with no asserts or comments; its length demonstrates that
# there is still too much code duplication.
##def merge_locatedbs(infile0, infile1, callback=print_callback):
##    curlens = [0, 0]
##    commonlen = 0
##    infiles = [infile0, infile1]
##    curstrs = ['', '']


##    newrec = read_locatedb_record(infile0)
##    if newrec is None:
##        return read_locatedb(infile1, callback)
##    else:
##        curlens[0], curstrs[0] = newrec


##    newrec = read_locatedb_record(infile1)
##    if newrec is None:
##        callback(curstrs[0])
##        return read_locatedb(infile0, callback,
##                             initial_curstr=curstrs[0],
##                             initial_curlen=curlens[0])
##    else:
##        curlens[1], curstrs[1] = newrec


##    while 1:
##        a, b = curstrs[0], curstrs[1]
##        while (len(a) > commonlen and
##               len(b) > commonlen and
##               a[commonlen] == b[commonlen]):
##            commonlen = commonlen + 1

##        if curstrs[0][commonlen:] < curstrs[1][commonlen:]:
##            which = 0
##        else:
##            which = 1
##        callback(curstrs[which])

##        newrec = read_locatedb_record(infiles[which])
##        if newrec is None:
##            callback(curstrs[not which])
##            return read_locatedb(infiles[not which], callback,
##                                 initial_curstr=curstrs[not which],
##                                 initial_curlen=curlens[not which])
##        length, suffix = newrec
##        curlens[which] = curlens[which] + length
##        curstrs[which] = curstrs[which][:curlens[which]] + suffix
##        commonlen = min(commonlen, curlens[which])

def encode(num):
    if -128 < num < 128:
        return chr(num % 256)
    else:
        if num < 0: num = num + 65536
        return '\x80' + chr(num / 256) + chr(num % 256)

def frcode(infile=sys.stdin, write=sys.stdout.write):
    prevline = ''
    prevprefix = 0
    while 1:
        line = infile.readline()
        if line == '': break
        while line[-1:] == '\n': line = line[:-1]
        prefix = 0
        while (len(prevline) > prefix and
               len(line) > prefix and
               prevline[prefix] == line[prefix]):
            prefix = prefix + 1
        diff = prefix - prevprefix
        prevprefix = prefix
        prevline = line
        write(encode(diff))
        write(line[prefix:])
        write('\0')

if __name__ == '__main__':
    if len(sys.argv) == 1 or sys.argv[1] == '-read': read_locatedb(sys.stdin)
    elif sys.argv[1] == '-merge': merge_locatedbs(open(sys.argv[2]), open(sys.argv[3]))
    elif sys.argv[1] == '-write': frcode()
    else: raise "neither -read nor -write nor -merge specified"

# fixed bugs:

# the inner loop at the top of merge_locatedbs, which compares
# character by character, was originally at the bottom; the effect was
# that the first two strings were assumed to have no characters in
# common.

# But this was true, because originally, the crufty code at the top
# that reads in an initial record from each file wasn't there, so
# there were an extra couple of blank lines at the beginning of
# output.

# At some point, the conditional assert that says the strings must
# differ at min_newcommonlen if they had different numbers of chars in
# common with their common parent was using the previous value of
# curlens[which] rather than its current value.  This was wrong.

# read_locatedb didn't always take initial_curlen and initial_curstr
# arguments, and so when merge_locatedbs passed control off to it, it
# would spit out decompressed lines that were missing their
# beginnings.


-- 
<[EMAIL PROTECTED]>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
Edsger Wybe Dijkstra died in August of 2002.  This is a terrible loss to the
whole world.  See http://advogato.org/person/raph/diary.html?start=252 and
http://www.kode-fu.com/geek/2002_08_04_archive.shtml

sorting lists of strings

Reply via email to