I've tried this on my most recent 200MB of email, and it seems to work
OK.  Future directions include:
- result ranking: most recent is not always best!
- handling bigger corpuses
- word-prefix search (krag*)
- field search (krag* in To)
- incremental search and display

Indexing 200MB of email, amounting to 30 000 messages containing a
little over a million searchable terms, took about 12 minutes.
Unfortunately, the program has little locality of reference when
building the index, so its performance drops by 1.5-3 orders of
magnitude when it starts to hit swap.

Thanks to:
- Doug Cutting, of Lucene, for insights on data storage
- Tim Bray for his recent series of articles on building search engines
- DMH, who may wish to remain pseudonymous, for inciting me to actually 
  do something by asserting that it wasn't hard
- Ka-Ping Yee for helping to provide the equipment
- Beatrice for tolerating me indexing my email instead of cleaning the house

This program obviously works only if you have Unix mbox files, and it
doesn't work properly on the SysV variety with content-length fields.

#!/usr/bin/env python
"""Index my mailbox.

This utility generates a term index of an mbox file and allows you to
search it.  If you run it with just the mbox file as an argument, it
assumes you want to reindex it; if you run it with the mbox file and
some other terms, it does a search.  Like Google, it's default-and,
and you can precede a term with a - to exclude messages containing it.

Requires recent Python, probably 2.3.  Won't work in 2.1.3.

I have only 4.5M postings out of 7K messages in 50MB, so I can make a
list of all postings in RAM.  It takes about a minute or two and
another 50MB of RAM.  Writing it to disk with Berkeley DB used to take
another 10 minutes or so!  The index files are about 27MB (down from
45MB with Berkeley DB), but gzip to <11MB.

Unimplemented feature 1: merge multiple index files into a single one,
so you can index files whose index won't fit in RAM.  (The index seems
to be about the same size as the file itself, so this probably means,
to index files bigger than RAM.)  If you had a really big corpus, you
might need to do this more than once, but who has a corpus that big?

Unimplemented feature 2: handle mbox files bigger than the address
space.

"""
# - Python space costs:
#   Looks like a dict is about 154 bytes, an empty list about 48,
#   and a one-item list about 96 bytes.  There are about 340K terms
#   if we use the naive [a-zA-Z0-9]+ term definition, so about 340K *
#   96 + 4M * 4 bytes = ~48MB.

# "Well, I've read about inverted indices and stuff, but I've never
# built a full-text search engine." - me on Thursday
# "Dude, it's, like, cake." - DMH

import mmap, re, sys, os

def sorted(alist):
    "Returns a list after sorting it."
    alist.sort()
    return alist

class lindb:
    """Linear database.

    I was storing data in a Berkeley DB file, but it was really slow ---
    it took 10 minutes to write out 45 megabytes of data with 340K keys.
    So I thought I'd take the Lucene approach; just write a sorted list
    of key-value pairs, write a "headwords" file to make it easy to find
    the right part of the file, then linearly search.

    This module lets it take much less time, by just sorting the data
    and storing it in a text file.  It's very slightly slower than the
    Berkeley DB version, but uses about half the space (for my data.)

    The big compromises are that this module doesn't allow the same
    flexibility in access as Berkeley DB, and it doesn't allow the storage
    of arbitrary data --- data with newlines or ": " in it can screw
    things up.

    The basic operations:
    - db = lindb(filename) --- open database in 'filename', creating if nec.
    - db.set(somedict) --- set contents of database to contents of 'somedict'
    - db[key] --- return value for 'key' or raise KeyError
    - db.get(key, default) --- return value for 'key' or default

    """

    ### basic primitives

    # I tried various powers of 2 here: 131072, 32768, 1024, 4096, and
    # now 8192; the smaller it was, the better the performance was,
    # but the difference between 4096 and 1024 wasn't significant.

    threshold_size = 4096

    def open(self, filename, mode):
        return file("%s/%s" % (self.dirname, filename), mode)

    def readline(self, f):
        """Read the next key-value pair from open file f.

        Returns (key, value) on success, or (None, None) on EOF.

        """
        line = f.readline()
        if not line: return None, None
        # XXX no robustness against incomplete writes
        return line[:-1].split(': ', 2)

    ### internal routines

    def headwords(self, filename):
        """Generate a headwords file for 'filename'.

        The headwords file lists a few words from the file, along with
        the positions of their entries in the file.  This allows the
        lookup process to find a particular entry relatively quickly,
        while retaining high locality of access.

        """
        f = self.open(filename, 'r')
        hwdict = {}
        blockno = None
        while 1:
            pos = f.tell()
            name, value = self.readline(f)
            if name is None: break
            nblockno = pos // self.threshold_size
            if nblockno != blockno:
                hwdict[name] = pos
                blockno = nblockno
        self.writefile(filename + '.hw', hwdict)

    def writefile(self, filename, contents):
        """Write file named 'filename' with contents of dict 'contents'.

        If necessary, this writes a headwords file for 'filename'.

        """
        # obviously this technique won't work at all for stuff that doesn't
        # fit in RAM
        f = self.open(filename, 'w')
        for key in sorted(contents.keys()):
            f.write("%s: %s\n" % (key, contents[key]))
        size = f.tell()
        f.close()
        if size > self.threshold_size: self.headwords(filename)
        else:
            try: os.unlink('%s/%s.hw' % (self.dirname, filename))
            except OSError: pass

    def lookup(self, filename, term):
        """Return greatest (key, value) pair not greater than 'term'.

        This returns the key-value pair where the key is term if there
        is one, otherwise the one before where 'term' would be.

        Uses headwords files if they exist to speed up access.

        """
        start = 0
        if os.path.exists('%s/%s.hw' % (self.dirname, filename)):
            name, value = self.lookup(filename + '.hw', term)
            if name is not None:
                assert type(name) is type(term)
                assert name <= term
                start = int(value)

        f = self.open(filename, 'r')
        f.seek(start, 0)

        name, value = None, None
        while 1:
            nname, nvalue = self.readline(f)
            if nname is None or nname > term: return name, value
            name, value = nname, nvalue

    # XXX maybe mmap?

    ### external interfaces

    def __init__(self, dirname):
        self.dirname = dirname
        if not os.path.exists(dirname): os.mkdir(dirname)
    def __getitem__(self, name):
        lname, lvalue = self.lookup('contents', name)
        if lname == name: return lvalue
        raise KeyError, name
    def get(self, name, value=None):
        try: return self[name]
        except KeyError: return value
    def set(self, contents):
        self.writefile('contents', contents)

def ok(a, b):
    "Regression test function."
    assert a == b, (a, b)

def test_lindb():
    "Very basic regression test for linear db class lindb."
    os.system('rm -rf tmp.db')
    x = lindb("tmp.db")
    x.set({'roar': 'lion'})
    ok(x['roar'], 'lion')
    ok(x.get('roar', 'boo'), 'lion')
    ok(x.get('foar', 'boo'), 'boo')
    os.system('rm -rf tmp.db')

test_lindb()

def mkindex(mm):
    """Create an in-memory index of an mbox file.

    This function takes a string-like object, such as an mmap object,
    and returns a list of byte offsets in it where mail messages
    start, and a postings dictionary that maps words to lists of
    message starting byte offsets.

    """
    poses = [0]
    wordpat = re.compile("[a-zA-Z0-9]+")
    fi = wordpat.finditer(mm)
    allpostings = {}
    while 1:
        # This won't match the beginning of the first message.
        nps = mm.find("\nFrom ", poses[-1])
        if nps == -1: nps = len(mm)
        nps += 1
        this_guys_postings = {}
        for mo in fi:
            if mo.start() >= nps:
                # I wish I could push back the item we just got, because
                # it belongs to the next message and won't get indexed,
                # but it's the "From" at the beginning of the message.
                # So it doesn't really matter.
                break
            this_guys_postings[mo.group(0).lower()] = 1
        for word in this_guys_postings.iterkeys():
            if not allpostings.has_key(word): allpostings[word] = []
            allpostings[word].append(poses[-1])
        if nps > len(mm): break
        poses.append(nps)
        if len(poses) % 250 == 0: print "%d msgs" % len(poses)
    return poses, allpostings

class index:
    """Index of an mbox.

    Stores a lindb under mboxname.idx which contains posting lists for
    the mbox's messages, and allows you to search it.

    """
    def postings(self, term):
        "Returns the posting list for a term."
        return [int(id) for id in self.db.get(term, '').split()]

    def msg(self, pos):
        """Returns the message at a particular byte offset in the mbox.

        The items in posting lists are just such byte offsets.

        """
        npos = self.mm.find("\nFrom ", pos + 1)
        if npos == -1: npos = self.size
        rv = self.mm[pos:npos]
        assert "\nFrom " not in rv
        return rv

    def __init__(self, fname):
        "init.  fname is the path to the mbox."
        self.f = file(fname)
        self.f.seek(0, 2) # EOF
        self.size = self.f.tell()
        self.mm = mmap.mmap(self.f.fileno(), self.size,
                            access=mmap.ACCESS_READ)
        self.db = lindb(fname + '.idx')

    def write(self):
        "Reindex an mbox."
        poses, allpostings = mkindex(self.mm)
        print "indexed;"
        for term in allpostings.keys():
            allpostings[term] = ' '.join([str(x) for x in allpostings[term]])
        self.db.set(allpostings)
        print "done."

    def search(self, terms, exclusions=[]):
        """Returns a posting list for some search.

        'terms' is a list of terms to require the presence of;
        'exclusions' is an optional list of terms to require the
        absence of.

        """
        lists = [self.postings(term) for term in terms]
        excludelists = [self.postings(term) for term in exclusions]

        # intersect lists.
        # sort by length.  ii is in the tuples to prevent comparing the lists
        # themselves.
        sorted_lists = sorted([(len(lists[ii]), ii, lists[ii])
                               for ii in range(len(lists))])
        # start with smallest 
        rvdict = dict([(key, 1) for key in sorted_lists[0][2]])

        # it might be better to, I don't know, do this last?
        for list in excludelists:
            for message in list:
                if rvdict.has_key(message): del rvdict[message]

        # now remove items not in all the other lists
        for _, _, list in sorted_lists[1:]:
            newdict = {}
            for message in list:
                if rvdict.has_key(message): newdict[message] = 1
            rvdict = newdict

        return sorted(rvdict.keys())
    
    def cmdlinesearch(self, terms):
        """Parses a search, runs it, and returns a posting list.

        You have to break it up into terms first, as the Unix command
        line does for you, but then this parses it to see what has a
        '-' before it and what doesn't.

        """
        plusterms = []
        minusterms = []
        for term in terms:
            if term.startswith('-'): minusterms.append(term[1:].lower())
            else: plusterms.append(term.lower())
        return self.search(plusterms, minusterms)
            
    
def main(argv, stdout):
    """Command-line search/reindex interface."""
    if len(argv) > 2:
        idx = index(argv[1])
        for pos in idx.cmdlinesearch(argv[2:]):
            stdout.write(idx.msg(pos))
    elif len(argv) == 2:
        index(argv[1]).write()
    else:
        print ("Usage:\n\t%s mailboxfilename -- to reindex\n"
          "\t%s mailboxfilename term1 term2 -unwanted1 -unwanted2 -- to look"
               % (argv[0], argv[0]))

if __name__ == '__main__': main(sys.argv, sys.stdout)

Reply via email to