fast mbox inverted indexing in C

Kragen Sitaker Wed, 14 Apr 2004 00:38:15 -0700

Compatible with previously-posted Python programs, but it's a bit
fiddly to get them to work together still.  Much faster.  May be
enjoyable if you like optimization.


/*
maildex: find words, index by mail message byte offset

This program computes an inverted index for a mailbox file, in the
same format produced by maildex.py, which I posted to kragen-hacks
earlier in March.  But it does it much faster.  On my 2.4GHz PIV, it
can index my 575MB current mailbox in 130 seconds of wallclock time,
70 seconds of CPU time.

Recent improvements, with time cost on my 500MHz PIII:
  (18 user seconds on my old sample 50MB dataset)
- skip words >20chars
  (16 user seconds)
- move word storage into wordentry struct
  (15 user seconds)
- make words case-insensitive up front
  (15 user seconds still; terms reduced only from 306_131 to 282_254)
- allocate memory stackwise
  (14 user seconds)
- do isalnum myself, restructure find_words to eliminate boolean
  variable, change findfrom() to use a switch instead of if/elsif,
  remove indirect function calls
  (12 user seconds)
  (on big 200M sample: 40 cpu seconds, 100-160 wallclock seconds)
  (on smaller 50M sample: 11.8-11.9 CPU seconds; CPU seconds seem more
  consistent than user seconds.  I lost the old 50M sample.)
- do tolower myself in parallel with extracting words, eliminate extra
  function
  (10.8 CPU seconds on new 50M sample)
- simplify hash function
  (10.4 CPU seconds on new 50M sample)

I'm compiling with this command line with gcc 3.3.3:
gcc -Wall -ftracer -O3 maildex.c -o maildex

Using -fprofile-arcs/-fbranch-probabilities shaves another 2% or so
off the run-time with gcc 3.3.3.  (If you want to do this, read the
documentation; you can't just add them to the commandline and expect a
speedup.)

To run the profiler, I have to use -fno-inline to get useful data.
Apparently my standard compilation turns the program into four big
functions.

Now, the mailbox is not part of the working set after the program
finishes indexing it, and is only accessed sequentially while the
program is indexing it.  So now this program can index mailboxes up
to about twice the size of RAM.

Detailed Notes on Optimizations
-------------------------------

TODO: It'd probably be good to automatically expand the allocated
memory instead of allocating all of it up front.  It's fine as long as
you don't have more than 512MB of RAM though.  (And as long as you
have at least that much swap.)

TODO: I've noticed that, on files being explicitly read, Linux 2.4
is sometimes kind of stupid about noticing that you're doing a
sequential read on a bunch of files.  I got a big performance boost
on a Python program by setting the buffer size to 256K (instead of
the default 4K) because I lost less time in disk seeks.  Similarly,
when the kernel faults stuff in for a page fault in an mmapped
region, you have to wait until the disk finishes seeking before you
can run; this program often runs 2-4 times as long as it should
because, even though the disk is capable of more read bandwidth
than this program can use, the kernel is stupidly choosing to read
chunks that are too small.  madvise() is supposed to be able to
solve this problem (with MADV_WILLNEED or at least
MADV_SEQUENTIAL), but while MADV_SEQUENTIAL seemed to improve
performance somewhat, it didn't really solve the problem.

I tried "rolling my own madvise" by forking a child process that
would access the same pages I was about to, but without that pesky
computation interleaved, both in ascending and descending order.
That dramatically reduced performance instead of improving it.

This is the program's biggest performance problem right now.  It
spends 30%-75% of its time waiting for the disk for no good reason.

TODO: It still seems like 10.4 seconds (~4800 kilobytes per second) is
awfully slow for a 500MHz CPU.  Does it really need to take 104 clock
cycles per byte to index my email?  gcc -fverbose-asm -S is helpful
when investigating this kind of thing.  Maybe I should try cachegrind
too.

(On the other hand, if it took 104 instructions per byte to make a
full-text index on an 4.77MHz IBM PC, then a 360K floppy would take
about 30 seconds to index, so maybe that's not so bad.)

TODO: Maybe using a trie instead of a hash table (since we have to
produce sorted output anyhow) would improve things.

Instrumenting the list-walking loop at the top of
add_posting_to_wordentry_list showed that it ran 8_255_534 times on a
50MB dataset that called add_posting_to_wordentry_list 6_734_351
times, yielding 282_254 terms --- so average chain length was 9 or so,
but the average number of words examined per loop was much less than
2.  Yay for move-to-front.  (But in the normal case the loop is only 6
instructions long anyway.)

TODO: to compare the words, it's using repz cmpsb, which is pretty
cool as far as it goes, but don't all of our words start on 32-bit
boundaries?  Maybe we could zero out the remainder and compare them a
32-bit cell at a time.  It's also uselessly checking for a zero
wordlen.

For a while, I thought it might be better to case-fold at output time,
so I'd only have to downcase 3 million characters or so instead of 50
million or so.  Then I changed the program so it fetches the lowercase
versions of characters as a side effect of testing whether or not the
character is a word character, so there's no extra cost.  (If I
changed to a faster way of distinguishing word from nonword
characters, it might not have this side effect and I might have to
reconsider.  I haven't found one.  I tried a switch statement, and gcc
turned it into a series of conditional branches on the theory that it
would be faster than an indirect jump.  Maybe it would have been, but
it was still a lot slower than indexing into the wordchar[] array.
(gcc also didn't realize that it could perform common subexpression
elimination on the call to the function containing the switch
statement, so I did the common subexpression elimination myself by
hand, but it was still slower.))

Much of the time and space this program spends on my mailbox, it
spends indexing binary attachments.  If I skipped the binary
attachments it would work so much better and faster.

The hash function I started with was Michael Neumann's version of
Peter J. Weinberger's well-known hash function (usually known as
"hashpjw".)  Briefly, this function rotates the hash left by four bits
each iteration, then adds in the current character.  This version had
a couple of strange features: if it discovered that the leftmost four
bits, before rotation, were zero, it avoided merging them in at the
low end, and it took pains to clear the upper four bits after the
rotation.  The first seemed like a misguided attempt at optimization
to me, and the second just seemed like a waste of time.  Simplifying
the hash function saved 0.4 CPU seconds on 50 megabytes of input text.

I also changed the hash function to count downward from the end of the
word instead of up from the beginning, which seems to have saved about
0.1 CPU seconds on the same amount of input text, but that's about a
1% improvement --- within my measurement error.

On Memory Usage, with Special Attention to the Worst Case
---------------------------------------------------------

This program, at present, allocates memory only when it's adding
postings to its postings hash.  A posting is 8 bytes, and most of
the time we only add one of those, or nothing, when we encounter a
new posting.  But some of the time we add a wordentry, which is
presently 20 bytes.  The program allocates a bunch of these
structures, then it twiddles them around for a while to sort lists
for output, and then it outputs stuff.  When I was using malloc,
the malloc overhead on all these 8-byte objects was killing me.
(mallocing 8 bytes 4 million times consumes 64 megs of RAM on my
machine.  So the best case for malloc overhead is one posting per
word, with 16 bytes of malloc overhead and 28 bytes of user data.
Ick.)

Typical empirical numbers, back when I was using malloc, follow:
50_000_000 bytes of email
5560 messages
4_416_563 words in them (according to wc)
411_168 different terms
2_826_904 postings + search terms, thus
2_415_736 postings
48M peak memory use (+ 48MB mmapped file)
22 seconds run-time
We'd expect 2_415_73616 + 411_16828 = 50.1 megabytes
22.6MB of this was malloc overhead.

So now this program allocates a single large area to start with,
and allocates stuff inside of it stack-wise by nudging a pointer.
It does all the work inside there, and then deallocates it all at
once at program exit.  This halved the amount of dirty data.

When there was no limit on word length, an index file containing
only search terms and newlines is 7_438_531 bytes, so average 17
characters per term (!).  The 306_131 terms of 20 characters or
less averaged 8.8 chars each; the 343_840 terms of 30 characters or
less average 11.5 chars each.  So, once we excluded terms of 20
characters or more, storing our strings in our wordentry structure
only cost us an extra 4-8 bytes, on average.  We stored them at the
end, with a count, obviating the char *word inside the structure.
This allowed us to eliminate the actual document from our working
set when we're sorting and writing out the wordlist, and made it
possible to shrink the posting lists significantly by making things
case-insensitive up front instead of, say, at output time.

 */
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>
#include <ctype.h>
#include <string.h>

#define MAXWORDLEN 20

#define min(a, b) ((a) < (b) ? (a) : (b))
#define align(val, size) (((int)(val) + ((size) - 1)) & ~((size)-1))


/* file stuff */

int open_or_die(char *fname, int *length) {
  int rv = open(fname, O_RDONLY);
  struct stat st;
  if (rv < 0) {
    perror(fname);
    exit(1);
  }
  fstat(rv, &st);
  *length = st.st_size;
  return rv;
}

/* word indexing stuff */

typedef struct _posting {
  struct _posting *next;
  int doc_id;
} posting;

typedef struct _wordentry {
  struct _wordentry *next;
  int wordlen;
  int last_doc_found;
  posting *postings;
  char word[0];
} wordentry;

char *arena = 0;
char *arena_ptr = 0;
void *allocate(size_t size) {
  char *rv = arena_ptr;
  arena_ptr += align(size, 4);
  return rv;
}

void add_posting_to_wordentry(wordentry *wep, int doc_id) {
  if (doc_id != wep->last_doc_found) {
    posting *new = allocate(sizeof(*new));
    new->doc_id = doc_id;
    new->next = wep->postings;
    wep->postings = new;
    wep->last_doc_found = doc_id;
  }
}

/* hot spot: takes 24% of program's run time */
void add_posting_to_wordentry_list(wordentry **list, char *word, int wordlen, 
                                   int doc_id) {
  wordentry **wepp;
  wordentry *new;

  for (wepp = list; *wepp; wepp = &((*wepp)->next)) {
    if ((*wepp)->wordlen == wordlen && !memcmp((*wepp)->word, word, wordlen)) 
      break;
  }

  new = *wepp;
  if (!new) {
    new = allocate(sizeof(*new)+wordlen);
    memcpy(new->word, word, wordlen);
    new->wordlen = wordlen;
    new->last_doc_found = -1; /* invalid doc_id */
    new->postings = NULL;
  }

  add_posting_to_wordentry(new, doc_id);
  
  /* delete from old position in list */
  if (*wepp) *wepp = (*wepp)->next;
  /* move to front */
  new->next = *list;
  *list = new;
}

void dump_wordentry_list(wordentry *list) {
  posting *p;
  for (; list; list = list->next) {
    fwrite(list->word, list->wordlen, 1, stdout);
    printf(": ");
    for (p = list->postings; p; p = p->next) {
      printf("%d", p->doc_id);
      if (p->next) printf(" ");
    }
    printf("\n");
  }
}

void test_word_indexing(void) {
  wordentry *list = NULL;
  char *words[] = { "fuzzy", "bunny", "bin", "packing", "bunny", "bunny", "bin", 0 };
  char **wp;
  for (wp = words; *wp; wp++) {
    add_posting_to_wordentry_list(&list, *wp, strlen(*wp), wp-words);
    add_posting_to_wordentry_list(&list, *wp, strlen(*wp), wp-words);
  }
  dump_wordentry_list(list);
}

/* hash table */

/* Since we have move-to-front, this size is only about 10%-15% better
 * than 1021 for a sample corpus with 14 million words, 112K distinct
 * words, and 6M or so postings.  */
#define HASHSIZE 32749
wordentry *index_hash[HASHSIZE];

/* thanks P. J. Weinberger and Michael Neumann. ick, this takes 12% of
   the program's run time, at 120 ns per call (according to gprof). */
int hash_word(char *s, int len) {
  int i;
  unsigned h = 0;
  for (i = len; i;) {
    i--;
    h = ((h << 4) ^ ((h&0xf0000000) >> 24)) + s[i];
  }
  return h % HASHSIZE;
}

/* dumb approach, used to be 10% of whole indexing process, unless my
 * profiler lies, which it might.  Dead code; see below for current
 * version.  */
char *memstr(char *haystack, char *needle, int haystack_size) {
  char *t;
  char *lim = haystack + haystack_size - strlen(needle);
  int i;
  for (t = haystack; t < lim; t++) {
    for (i = 0; ; i++) {
      if (needle[i] == '\0') return t;
      if (needle[i] != t[i]) break;
    }
  }
  return NULL;
}

/* really "smart" approach.  Boyer-Moore string search for "\nFrom ",
 * hand-compiled to C.  About twice as fast as "dumb approach" above.
 * This makes the whole program about 5% faster, according to
 * gprof.  Maybe that wasn't so smart after all. */
char *findfrom(char *haystack, int haystack_size) {
  char *s;
  char *t = s = haystack + 5;
  for (;;) {
    if (*t != ' ') goto advance; t--;
    if (*t != 'm') goto advance; t--;
    if (*t != 'o') goto advance; t--;
    if (*t != 'r') goto advance; t--;
    if (*t != 'F') goto advance; t--;
    if (*t == '\n') return t;
  advance:
    /* we know there's no match ending at or before 's', and the
     * characters after 't' look plausible.  Given this particular search
     * string, there are two possibilities: either t<s, so the next
     * possible match starts at s+1 and runs to s+6, or (the usual case)
     * t==s, and we need to be alert to the opportunity that we're in the
     * middle of a match now.
     */
    /* post switch: 12.12, 12.28, 12.24.
     * pre switch (if/elsif/else): 12.24, 12.33, 12.25.  No noticeable
     * speed difference, but this is clearer. */
    if (t==s) {
      switch(*t) {
      case 'm': s += 1; break;
      case 'o': s += 2; break;
      case 'r': s += 3; break;
      case 'F': s += 4; break;
      case '\n': s += 5; break;
      default: s += 6; break;
      }
    }
    else s += 6;
    t = s;
    if (t > haystack + haystack_size) return 0;
  }
}

/* a substitute for isalnum() and tolower(); using this for isalnum()
 * reduced user CPU time from 13.6 to 12.3 user seconds on the old
 * 50MB test data set, and using it for tolower() reduced total CPU
 * time from 11.3 to 10.77 seconds on the new 50MB data set. */
char wordchar[] = {
  "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0"
  "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" /* control chars */
  "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" /* space through / */
  "0123"     "4567"     "89\0\0"   "\0\0\0\0" /* digits, :;<=>? */

  "\0abc"    "defg"     "hijk"     "lmno"     /* capitals */
  "pqrs"     "tuvw"     "xyz\0"    "\0\0\0\0"
  "\0abc"    "defg"     "hijk"     "lmno"     /* lower case */
  "pqrs"     "tuvw"     "xyz\0"    "\0\0\0\0"

  "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0"
  "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" 
  "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0"
  "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" 

  "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0"
  "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" 
  "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0"
  "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" 
};

int current_doc_id;

/* hot spot: takes up 18% of program's CPU time */
void find_words(char *s, int len) {
  char *t = s;
  static char word[MAXWORDLEN];
  int i;

  for (;;) {
  skip_nonword_characters:
    for (;;) {
      if (t >= s + len) return;
      if (wordchar[(unsigned char)*t]) break;
      t++;
    }

    /* scan word */
    i = 0;
    while (t < s + len && wordchar[(unsigned char)*t]) {
      if (i == MAXWORDLEN) {
        while (t < s + len && wordchar[(unsigned char)*t]) t++;
        goto skip_nonword_characters;
      } 
      word[i] = wordchar[(unsigned char)*t];
      i++;
      t++;
    }

    add_posting_to_wordentry_list(index_hash + hash_word(word, i),
                                  word, i, current_doc_id);
  }
}

void index_by_mail_messages(char *s, int len) {
  char *new_msg_start;

  /* in this function, this is the offset of the message start */
  current_doc_id = 0;
  for (;;) {
    new_msg_start = findfrom(s + current_doc_id, len - current_doc_id);
    if (!new_msg_start) new_msg_start = s + len;

    find_words(s + current_doc_id, new_msg_start - s - current_doc_id);

    if (new_msg_start == s + len) break;

    current_doc_id = new_msg_start - s + 1;
  }
}

/* sorting the output by term */

void push(wordentry **stack, wordentry *item) {
  item->next = *stack;
  *stack = item;
}

wordentry *pop(wordentry **stack) {
  wordentry *rv = *stack;
  *stack = rv->next;
  /* should probably rv->next = NULL; */
  return rv;
}

wordentry *nconc(wordentry *shortlist, wordentry *longlist) {
  wordentry **join;
  if (!shortlist) return longlist;
  join = &(shortlist->next);
  while (*join) join = &((*join)->next);
  *join = longlist;
  return shortlist;
}

wordentry *all_entries() {
  int i;
  wordentry *rv = NULL;
  for (i = 0; i < HASHSIZE; i++) {
    rv = nconc(index_hash[i], rv);
    index_hash[i] = 0;
  }
  return rv;
}

/* sort linked lists by moving entries to different lists.  No
 * robustness against quicksort's O(N^2) worst case, because the input
 * data is sorted by hash value, then by recentness of occurrence, so
 * is very unlikely to be nearly sorted already.  Hotspot; 14% of run
 * time is here. */
wordentry *quicksort_wordentries(wordentry *list) {
  wordentry *first, *before = NULL, *after = NULL, *wep;
  int cmp;
  if (!list) return list;
  first = pop(&list);
  while (list) {
    wep = pop(&list);
    cmp = memcmp(wep->word, first->word, min(wep->wordlen, first->wordlen));
    if (!cmp && wep->wordlen < first->wordlen) cmp = -1;
    push((cmp < 0 ? &before : &after), wep);
  }
  first->next = quicksort_wordentries(after);
  return nconc(quicksort_wordentries(before), first);
}

/* main */

int main(int argc, char **argv) {
  int length;
  int fh = open_or_die(argv[1], &length);
  char *mm = mmap(0, length, PROT_READ, MAP_SHARED, fh, 0);

  if (mm == MAP_FAILED) {
    perror("mmap");
    return 2;
  }

  arena_ptr = arena = malloc(512 * 1024 * 1024);

  if (!arena) {
    perror("malloc");
    return 3;
  }

  index_by_mail_messages(mm, length);
  munmap(mm, length);
  dump_wordentry_list(quicksort_wordentries(all_entries()));
  return 0;
}

fast mbox inverted indexing in C

Reply via email to