Re: full-text search of email

Dave Long Sun, 07 Mar 2004 01:01:09 -0800


> Unfortunately, the program has little locality of reference when
> building the index, so its performance drops by 1.5-3 orders of
> magnitude when it starts to hit swap.


I've tried a tape-like approach to
this problem.  Much slower than the
big dict approach, but:
- it's done in sh+utilities, and
- it only seems to suffer by about
  a factor of two once it starts
  pegging the disk light.

It expects an mbox-formatted file,
named 'msgbox', and will create two
indices: 
- msgbox.wdx, containing the search terms
- msgbox.mdx, indexing message boundaries

The first step in indexing is to stream
through the mbox, generating a tape of:

| string offset |

pairs, where 'string' is a search term,
and 'offset' is the byte offset of the
start of the message where it was found,
ascending by offset.

Given that tape, the search term index
follows the same format, but re-sorted
to contain unique entries, ascending by
the string field.

The message boundary index is formed by
using the breaks between ascending pairs
of message offsets to generate an index
of the form:

| offset length |

for each individual mail message, sorted
by normal sort order, not numerically.

Why is everything kept in ascending string
order?  So we can use 'look' and 'join':
looking up a series of terms involves a
'look' in the word index for each term,
and the appropriate (positive or negative)
'join' to come up with a results list of
entries of the form:

| offset |

from which we can do a 'look' against the
message index to form a list of entries:

| offset length |

which enables us to copy the relevant byte
ranges out of the original mbox.

Just about all of this can be done in very
little memory: both the index analysis and
the joins for retrieval involve streaming
the data past the CPU, much like a tape;
the lookups are random access, but have a
good locality both for the roots of their
binary searches and for the generated list
of results.  The sorts should be the only
components that need to have more than a
record or two present at a time.  Does a
stock sort(1) tune itself for the amount 
of real memory available?

-Dave

:: :: ::

#!/bin/bash

M=${M:-msgbox}

[ -f $M ] || { echo "$0: can't find '$M'" 1>&2; exit 1 }

if [ ! -f $M.wdx -o $M.wdx -ot $M ]; then
        echo "...reindexing..." 1>&2
        awk '$0 ~ /^From /              { gsub(/From /,"\001\001\001\001 ") }
                                        { print }' $M |\
        tr "[A-Z] \t" "[a-z]\n" |\
        awk --posix 'BEGIN              { msg = off = 0 }
        $0 == "\001\001\001\001"        { msg = off; off += 5; next }
        length($0)>2 && length($0)<20 && $0~/^[a-z]/    { print $0 " " msg }
                                        { off += length($0)+1 }
        END                             { print "z\001 " off }' | \
        tr -cd "[a-z0-9 \n]" > /tmp/$$

        sort /tmp/$$ | uniq > $M.wdx

        awk ' NR==1      { alpha = $2; next }
        $2 != alpha      { print alpha " " $2-alpha | "sort" }
                         { alpha = $2 }' < /tmp/$$ > $M.mdx

        rm /tmp/$$
fi

rs=/tmp/$$
jn=/tmp/$$j
tp=/tmp/$$t

lk() { look -f "$1 " $M.wdx | cut -d' ' -f2 }

if [ x"$1" = x ]; then
        exit
fi

lk $1 > $rs
shift; for i in $*; do
        case $i in
        -*) lk ${i#-} > $jn; jarg=-v1 ;;
        *)  lk $i     > $jn; jarg=    ;;
        esac
        join $jarg $rs $jn > $tp
        mv $tp $rs
done

while read m; do
        look $m $M.mdx
done < $rs | sort -nr |\
while read off len; do
        tail -c +$off < $M | head -c $len
done

rm -f $rs $jn

Re: full-text search of email

Reply via email to