Here's some working Python code for processing "front-coded" lists of
strings; "front-coding" is a compression method in which each string
is represented as an integer specifying the number of characters at
its beginning that are the same as the characters at the beginning of
the previous string in the list, and a string that follows it
indicating what characters are different.
This code can compress or decompress such a list, in the same format
used by GNU locate (which is useful if you want to decode
/var/lib/locate/locatedb) and can merge two such sorted lists into a
larger (uncompressed) sorted list.
Ordinarily, sorting lists of strings by comparison methods (as opposed
to radix sorting) gets slower as the strings have longer common
prefixes, because each lexicographic compare operation must start at
the beginning of the strings; the merge algorithm implemented here
doesn't need to do that, because it incrementally updates its
information about the length of the common prefix of the strings as it
goes. (It's about two orders of magnitude slower than /bin/sort on my
test dataset (30K lines of 1dfre10.txt)), though, and that's just
doing a merge, not a full sort, which would take 15 merges. But it
*is* in Python, and it does do lots of asserts, and it does produce
uncompressed output.)
It seems to me that this algorithm might be useful for constructing
large suffix arrays with fewer random disk seeks than the algorithms I
know about. However, I haven't finished reviewing the literature, and
these days, a corpus has to be pretty large to overflow RAM. (I think
SDRAM presently costs about a dollar a megabyte when you include all
the attendant non-RAM costs: cases, power supplies, CPUs,
motherboards, network cards, network ports, air conditioning, and
power lines.)
The suffix-array creation method is as follows.
We merge-sort a data file of records, each representing some suffix of
the source text; each record contains a prefix length, some text
following the prefix, and additionally, a pointer into the source text
where more text can be fetched if necessary; as sorting proceeds, the
text following the prefix tends to get shorter (as records get put
next to other records with longer and longer common prefixes). To
compensate for this, we store indices in the source text from which
more data needs to be fetched to keep the text in the data files being
sorted relatively long while we are merging, sort the indices, and
then read in data from those indices in sorted order, to be used to
augment the data file on the next pass.
In some cases, it is actually necessary to do a random seek into the
data file to get data to compare two prefixes. It's desirable to keep
these random seeks to a minimum, because they take a lot of time; on a
modern disk with 4ms rotational latency, 7ms seek time, and 30MB/s
data transfer rate, a seek ties up the disk about as long as
transferring a third of a megabyte, so it's desirable to keep them
down to one random seek per megabyte of data to be sorted or less.
#!/usr/bin/python
# encode, decode, or merge files compressed in the format of 'frcode'
# from the GNU locate package; this could be very useful, for example,
# when you have lost a bunch of files but don't know which ones, but
# you have an old locatedb from before the crash. You can use this
# program to decompress your old locatedb.
# TODO: compressed output from merge; translate into C; test for
# speed; extend to deal with sistrings.
# Note that this Python version is rather slow; it can only merge
# about 500 lines (of 1dfre10.txt) per second on my Pentium-366
# laptop.
import string, sys
def print_callback(astring):
print astring
sys.stdout.flush()
def read_locatedb_record(infile):
lengthchar = infile.read(1)
if lengthchar == '':
return None
length = ord(lengthchar)
if length == 0x80: # magic value indicating 16-bit length
highbyte = infile.read(1)
if highbyte == '': raise "end of file in extended length"
lowbyte = infile.read(1)
if lowbyte == '': raise "end of file in extended length"
length = ord(highbyte) * 256 + ord(lowbyte)
if length >= 0x8000: length = length - 0x10000
elif length > 0x80:
length = length - 256
suffix = []
while 1:
suffixchar = infile.read(1)
if suffixchar == '':
raise "end of file in suffix", suffix
if suffixchar == '\0': break
suffix.append(suffixchar)
return length, string.join(suffix, '')
def read_locatedb(infile, callback=print_callback,
initial_curlen=0, initial_curstr=''):
curlen = initial_curlen
curstr = initial_curstr
while 1:
newrec = read_locatedb_record(infile)
if newrec is None: break
length, suffix = newrec
curlen = curlen + length
curstr = curstr[:curlen] + suffix
callback(curstr)
def merge_locatedbs(infile0, infile1, callback=print_callback):
curlens = [0, 0]
commonlen = 0
infiles = [infile0, infile1]
curstrs = ['', '']
newrec = read_locatedb_record(infile0)
if newrec is None:
return read_locatedb(infile1, callback)
else:
curlens[0], curstrs[0] = newrec
newrec = read_locatedb_record(infile1)
if newrec is None:
callback(curstrs[0])
return read_locatedb(infile0, callback,
initial_curstr=curstrs[0],
initial_curlen=curlens[0])
else:
curlens[1], curstrs[1] = newrec
while 1:
a, b = curstrs[0], curstrs[1]
while (len(a) > commonlen and
len(b) > commonlen and
a[commonlen] == b[commonlen]):
commonlen = commonlen + 1
assert commonlen <= len(curstrs[0])
assert commonlen <= len(curstrs[1])
assert curstrs[0][:commonlen] == curstrs[1][:commonlen]
if len(curstrs[0]) > commonlen and len(curstrs[1]) > commonlen:
assert curstrs[0][commonlen] != curstrs[1][commonlen]
if curstrs[0][commonlen:] < curstrs[1][commonlen:]:
which = 0
else:
which = 1
callback(curstrs[which])
newrec = read_locatedb_record(infiles[which])
if newrec is None: # we finished one file
callback(curstrs[not which])
return read_locatedb(infiles[not which], callback,
initial_curstr=curstrs[not which],
initial_curlen=curlens[not which])
length, suffix = newrec
curlens[which] = curlens[which] + length
curstrs[which] = curstrs[which][:curlens[which]] + suffix
a, b = curstrs[which], curstrs[not which]
# If we have three strings x, y, z, and we know x[:xn] == y[:xn]
# and z[:zn] == y[:zn], then we know
# x[:min(xn, zn)] == z[:min(xn, zn)].
# Here, x is the new curstrs[which], y is the old
# curstrs[which], z is curstrs[not which], xn is
# curlens[which], and zn is commonlen.
min_newcommonlen = min(commonlen, curlens[which])
assert (a[:min_newcommonlen] == b[:min_newcommonlen])
if commonlen != curlens[which]:
# This assertion says that when these two aren't equal,
# the inner loop at the top will run for zero iterations.
# Remember, we know not only that x[:xn] == y[:xn], but
# that curlens[which] is the largest xn for which this is
# true.
# WLOG assume xn < zn. Now, we know len(y) > xn, because
# zn <= len(y), so we know y[xn] exists. Now, either
# x[xn] != y[xn] or len(x) == xn. In either case,
# x[:xn+1] != y[:xn+1], but since xn < zn, xn+1 <= zn, so
# y[:xn+1] == z[:xn+1]. Therefore, x[:xn+1] != z[:xn+1].
# So xn is the largest value foo for which x[:foo] ==
# z[:foo].
# The above doesn't hold when xn == zn; in that case, we
# may have to examine more characters to find the length
# of the longest match.
assert (len(a) == min_newcommonlen or
len(b) == min_newcommonlen or
(a[min_newcommonlen] !=
b[min_newcommonlen]))
commonlen = min_newcommonlen
# a version with no asserts or comments; its length demonstrates that
# there is still too much code duplication.
##def merge_locatedbs(infile0, infile1, callback=print_callback):
## curlens = [0, 0]
## commonlen = 0
## infiles = [infile0, infile1]
## curstrs = ['', '']
## newrec = read_locatedb_record(infile0)
## if newrec is None:
## return read_locatedb(infile1, callback)
## else:
## curlens[0], curstrs[0] = newrec
## newrec = read_locatedb_record(infile1)
## if newrec is None:
## callback(curstrs[0])
## return read_locatedb(infile0, callback,
## initial_curstr=curstrs[0],
## initial_curlen=curlens[0])
## else:
## curlens[1], curstrs[1] = newrec
## while 1:
## a, b = curstrs[0], curstrs[1]
## while (len(a) > commonlen and
## len(b) > commonlen and
## a[commonlen] == b[commonlen]):
## commonlen = commonlen + 1
## if curstrs[0][commonlen:] < curstrs[1][commonlen:]:
## which = 0
## else:
## which = 1
## callback(curstrs[which])
## newrec = read_locatedb_record(infiles[which])
## if newrec is None:
## callback(curstrs[not which])
## return read_locatedb(infiles[not which], callback,
## initial_curstr=curstrs[not which],
## initial_curlen=curlens[not which])
## length, suffix = newrec
## curlens[which] = curlens[which] + length
## curstrs[which] = curstrs[which][:curlens[which]] + suffix
## commonlen = min(commonlen, curlens[which])
def encode(num):
if -128 < num < 128:
return chr(num % 256)
else:
if num < 0: num = num + 65536
return '\x80' + chr(num / 256) + chr(num % 256)
def frcode(infile=sys.stdin, write=sys.stdout.write):
prevline = ''
prevprefix = 0
while 1:
line = infile.readline()
if line == '': break
while line[-1:] == '\n': line = line[:-1]
prefix = 0
while (len(prevline) > prefix and
len(line) > prefix and
prevline[prefix] == line[prefix]):
prefix = prefix + 1
diff = prefix - prevprefix
prevprefix = prefix
prevline = line
write(encode(diff))
write(line[prefix:])
write('\0')
if __name__ == '__main__':
if len(sys.argv) == 1 or sys.argv[1] == '-read': read_locatedb(sys.stdin)
elif sys.argv[1] == '-merge': merge_locatedbs(open(sys.argv[2]), open(sys.argv[3]))
elif sys.argv[1] == '-write': frcode()
else: raise "neither -read nor -write nor -merge specified"
# fixed bugs:
# the inner loop at the top of merge_locatedbs, which compares
# character by character, was originally at the bottom; the effect was
# that the first two strings were assumed to have no characters in
# common.
# But this was true, because originally, the crufty code at the top
# that reads in an initial record from each file wasn't there, so
# there were an extra couple of blank lines at the beginning of
# output.
# At some point, the conditional assert that says the strings must
# differ at min_newcommonlen if they had different numbers of chars in
# common with their common parent was using the previous value of
# curlens[which] rather than its current value. This was wrong.
# read_locatedb didn't always take initial_curlen and initial_curstr
# arguments, and so when merge_locatedbs passed control off to it, it
# would spit out decompressed lines that were missing their
# beginnings.
--
<[EMAIL PROTECTED]> Kragen Sitaker <http://www.pobox.com/~kragen/>
Edsger Wybe Dijkstra died in August of 2002. This is a terrible loss to the
whole world. See http://advogato.org/person/raph/diary.html?start=252 and
http://www.kode-fu.com/geek/2002_08_04_archive.shtml