Compatible with previously-posted Python programs, but it's a bit fiddly to get them to work together still. Much faster. May be enjoyable if you like optimization.
/* maildex: find words, index by mail message byte offset This program computes an inverted index for a mailbox file, in the same format produced by maildex.py, which I posted to kragen-hacks earlier in March. But it does it much faster. On my 2.4GHz PIV, it can index my 575MB current mailbox in 130 seconds of wallclock time, 70 seconds of CPU time. Recent improvements, with time cost on my 500MHz PIII: (18 user seconds on my old sample 50MB dataset) - skip words >20chars (16 user seconds) - move word storage into wordentry struct (15 user seconds) - make words case-insensitive up front (15 user seconds still; terms reduced only from 306_131 to 282_254) - allocate memory stackwise (14 user seconds) - do isalnum myself, restructure find_words to eliminate boolean variable, change findfrom() to use a switch instead of if/elsif, remove indirect function calls (12 user seconds) (on big 200M sample: 40 cpu seconds, 100-160 wallclock seconds) (on smaller 50M sample: 11.8-11.9 CPU seconds; CPU seconds seem more consistent than user seconds. I lost the old 50M sample.) - do tolower myself in parallel with extracting words, eliminate extra function (10.8 CPU seconds on new 50M sample) - simplify hash function (10.4 CPU seconds on new 50M sample) I'm compiling with this command line with gcc 3.3.3: gcc -Wall -ftracer -O3 maildex.c -o maildex Using -fprofile-arcs/-fbranch-probabilities shaves another 2% or so off the run-time with gcc 3.3.3. (If you want to do this, read the documentation; you can't just add them to the commandline and expect a speedup.) To run the profiler, I have to use -fno-inline to get useful data. Apparently my standard compilation turns the program into four big functions. Now, the mailbox is not part of the working set after the program finishes indexing it, and is only accessed sequentially while the program is indexing it. So now this program can index mailboxes up to about twice the size of RAM. Detailed Notes on Optimizations ------------------------------- TODO: It'd probably be good to automatically expand the allocated memory instead of allocating all of it up front. It's fine as long as you don't have more than 512MB of RAM though. (And as long as you have at least that much swap.) TODO: I've noticed that, on files being explicitly read, Linux 2.4 is sometimes kind of stupid about noticing that you're doing a sequential read on a bunch of files. I got a big performance boost on a Python program by setting the buffer size to 256K (instead of the default 4K) because I lost less time in disk seeks. Similarly, when the kernel faults stuff in for a page fault in an mmapped region, you have to wait until the disk finishes seeking before you can run; this program often runs 2-4 times as long as it should because, even though the disk is capable of more read bandwidth than this program can use, the kernel is stupidly choosing to read chunks that are too small. madvise() is supposed to be able to solve this problem (with MADV_WILLNEED or at least MADV_SEQUENTIAL), but while MADV_SEQUENTIAL seemed to improve performance somewhat, it didn't really solve the problem. I tried "rolling my own madvise" by forking a child process that would access the same pages I was about to, but without that pesky computation interleaved, both in ascending and descending order. That dramatically reduced performance instead of improving it. This is the program's biggest performance problem right now. It spends 30%-75% of its time waiting for the disk for no good reason. TODO: It still seems like 10.4 seconds (~4800 kilobytes per second) is awfully slow for a 500MHz CPU. Does it really need to take 104 clock cycles per byte to index my email? gcc -fverbose-asm -S is helpful when investigating this kind of thing. Maybe I should try cachegrind too. (On the other hand, if it took 104 instructions per byte to make a full-text index on an 4.77MHz IBM PC, then a 360K floppy would take about 30 seconds to index, so maybe that's not so bad.) TODO: Maybe using a trie instead of a hash table (since we have to produce sorted output anyhow) would improve things. Instrumenting the list-walking loop at the top of add_posting_to_wordentry_list showed that it ran 8_255_534 times on a 50MB dataset that called add_posting_to_wordentry_list 6_734_351 times, yielding 282_254 terms --- so average chain length was 9 or so, but the average number of words examined per loop was much less than 2. Yay for move-to-front. (But in the normal case the loop is only 6 instructions long anyway.) TODO: to compare the words, it's using repz cmpsb, which is pretty cool as far as it goes, but don't all of our words start on 32-bit boundaries? Maybe we could zero out the remainder and compare them a 32-bit cell at a time. It's also uselessly checking for a zero wordlen. For a while, I thought it might be better to case-fold at output time, so I'd only have to downcase 3 million characters or so instead of 50 million or so. Then I changed the program so it fetches the lowercase versions of characters as a side effect of testing whether or not the character is a word character, so there's no extra cost. (If I changed to a faster way of distinguishing word from nonword characters, it might not have this side effect and I might have to reconsider. I haven't found one. I tried a switch statement, and gcc turned it into a series of conditional branches on the theory that it would be faster than an indirect jump. Maybe it would have been, but it was still a lot slower than indexing into the wordchar[] array. (gcc also didn't realize that it could perform common subexpression elimination on the call to the function containing the switch statement, so I did the common subexpression elimination myself by hand, but it was still slower.)) Much of the time and space this program spends on my mailbox, it spends indexing binary attachments. If I skipped the binary attachments it would work so much better and faster. The hash function I started with was Michael Neumann's version of Peter J. Weinberger's well-known hash function (usually known as "hashpjw".) Briefly, this function rotates the hash left by four bits each iteration, then adds in the current character. This version had a couple of strange features: if it discovered that the leftmost four bits, before rotation, were zero, it avoided merging them in at the low end, and it took pains to clear the upper four bits after the rotation. The first seemed like a misguided attempt at optimization to me, and the second just seemed like a waste of time. Simplifying the hash function saved 0.4 CPU seconds on 50 megabytes of input text. I also changed the hash function to count downward from the end of the word instead of up from the beginning, which seems to have saved about 0.1 CPU seconds on the same amount of input text, but that's about a 1% improvement --- within my measurement error. On Memory Usage, with Special Attention to the Worst Case --------------------------------------------------------- This program, at present, allocates memory only when it's adding postings to its postings hash. A posting is 8 bytes, and most of the time we only add one of those, or nothing, when we encounter a new posting. But some of the time we add a wordentry, which is presently 20 bytes. The program allocates a bunch of these structures, then it twiddles them around for a while to sort lists for output, and then it outputs stuff. When I was using malloc, the malloc overhead on all these 8-byte objects was killing me. (mallocing 8 bytes 4 million times consumes 64 megs of RAM on my machine. So the best case for malloc overhead is one posting per word, with 16 bytes of malloc overhead and 28 bytes of user data. Ick.) Typical empirical numbers, back when I was using malloc, follow: 50_000_000 bytes of email 5560 messages 4_416_563 words in them (according to wc) 411_168 different terms 2_826_904 postings + search terms, thus 2_415_736 postings 48M peak memory use (+ 48MB mmapped file) 22 seconds run-time We'd expect 2_415_73616 + 411_16828 = 50.1 megabytes 22.6MB of this was malloc overhead. So now this program allocates a single large area to start with, and allocates stuff inside of it stack-wise by nudging a pointer. It does all the work inside there, and then deallocates it all at once at program exit. This halved the amount of dirty data. When there was no limit on word length, an index file containing only search terms and newlines is 7_438_531 bytes, so average 17 characters per term (!). The 306_131 terms of 20 characters or less averaged 8.8 chars each; the 343_840 terms of 30 characters or less average 11.5 chars each. So, once we excluded terms of 20 characters or more, storing our strings in our wordentry structure only cost us an extra 4-8 bytes, on average. We stored them at the end, with a count, obviating the char *word inside the structure. This allowed us to eliminate the actual document from our working set when we're sorting and writing out the wordlist, and made it possible to shrink the posting lists significantly by making things case-insensitive up front instead of, say, at output time. */ #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <unistd.h> #include <ctype.h> #include <string.h> #define MAXWORDLEN 20 #define min(a, b) ((a) < (b) ? (a) : (b)) #define align(val, size) (((int)(val) + ((size) - 1)) & ~((size)-1)) /* file stuff */ int open_or_die(char *fname, int *length) { int rv = open(fname, O_RDONLY); struct stat st; if (rv < 0) { perror(fname); exit(1); } fstat(rv, &st); *length = st.st_size; return rv; } /* word indexing stuff */ typedef struct _posting { struct _posting *next; int doc_id; } posting; typedef struct _wordentry { struct _wordentry *next; int wordlen; int last_doc_found; posting *postings; char word[0]; } wordentry; char *arena = 0; char *arena_ptr = 0; void *allocate(size_t size) { char *rv = arena_ptr; arena_ptr += align(size, 4); return rv; } void add_posting_to_wordentry(wordentry *wep, int doc_id) { if (doc_id != wep->last_doc_found) { posting *new = allocate(sizeof(*new)); new->doc_id = doc_id; new->next = wep->postings; wep->postings = new; wep->last_doc_found = doc_id; } } /* hot spot: takes 24% of program's run time */ void add_posting_to_wordentry_list(wordentry **list, char *word, int wordlen, int doc_id) { wordentry **wepp; wordentry *new; for (wepp = list; *wepp; wepp = &((*wepp)->next)) { if ((*wepp)->wordlen == wordlen && !memcmp((*wepp)->word, word, wordlen)) break; } new = *wepp; if (!new) { new = allocate(sizeof(*new)+wordlen); memcpy(new->word, word, wordlen); new->wordlen = wordlen; new->last_doc_found = -1; /* invalid doc_id */ new->postings = NULL; } add_posting_to_wordentry(new, doc_id); /* delete from old position in list */ if (*wepp) *wepp = (*wepp)->next; /* move to front */ new->next = *list; *list = new; } void dump_wordentry_list(wordentry *list) { posting *p; for (; list; list = list->next) { fwrite(list->word, list->wordlen, 1, stdout); printf(": "); for (p = list->postings; p; p = p->next) { printf("%d", p->doc_id); if (p->next) printf(" "); } printf("\n"); } } void test_word_indexing(void) { wordentry *list = NULL; char *words[] = { "fuzzy", "bunny", "bin", "packing", "bunny", "bunny", "bin", 0 }; char **wp; for (wp = words; *wp; wp++) { add_posting_to_wordentry_list(&list, *wp, strlen(*wp), wp-words); add_posting_to_wordentry_list(&list, *wp, strlen(*wp), wp-words); } dump_wordentry_list(list); } /* hash table */ /* Since we have move-to-front, this size is only about 10%-15% better * than 1021 for a sample corpus with 14 million words, 112K distinct * words, and 6M or so postings. */ #define HASHSIZE 32749 wordentry *index_hash[HASHSIZE]; /* thanks P. J. Weinberger and Michael Neumann. ick, this takes 12% of the program's run time, at 120 ns per call (according to gprof). */ int hash_word(char *s, int len) { int i; unsigned h = 0; for (i = len; i;) { i--; h = ((h << 4) ^ ((h&0xf0000000) >> 24)) + s[i]; } return h % HASHSIZE; } /* dumb approach, used to be 10% of whole indexing process, unless my * profiler lies, which it might. Dead code; see below for current * version. */ char *memstr(char *haystack, char *needle, int haystack_size) { char *t; char *lim = haystack + haystack_size - strlen(needle); int i; for (t = haystack; t < lim; t++) { for (i = 0; ; i++) { if (needle[i] == '\0') return t; if (needle[i] != t[i]) break; } } return NULL; } /* really "smart" approach. Boyer-Moore string search for "\nFrom ", * hand-compiled to C. About twice as fast as "dumb approach" above. * This makes the whole program about 5% faster, according to * gprof. Maybe that wasn't so smart after all. */ char *findfrom(char *haystack, int haystack_size) { char *s; char *t = s = haystack + 5; for (;;) { if (*t != ' ') goto advance; t--; if (*t != 'm') goto advance; t--; if (*t != 'o') goto advance; t--; if (*t != 'r') goto advance; t--; if (*t != 'F') goto advance; t--; if (*t == '\n') return t; advance: /* we know there's no match ending at or before 's', and the * characters after 't' look plausible. Given this particular search * string, there are two possibilities: either t<s, so the next * possible match starts at s+1 and runs to s+6, or (the usual case) * t==s, and we need to be alert to the opportunity that we're in the * middle of a match now. */ /* post switch: 12.12, 12.28, 12.24. * pre switch (if/elsif/else): 12.24, 12.33, 12.25. No noticeable * speed difference, but this is clearer. */ if (t==s) { switch(*t) { case 'm': s += 1; break; case 'o': s += 2; break; case 'r': s += 3; break; case 'F': s += 4; break; case '\n': s += 5; break; default: s += 6; break; } } else s += 6; t = s; if (t > haystack + haystack_size) return 0; } } /* a substitute for isalnum() and tolower(); using this for isalnum() * reduced user CPU time from 13.6 to 12.3 user seconds on the old * 50MB test data set, and using it for tolower() reduced total CPU * time from 11.3 to 10.77 seconds on the new 50MB data set. */ char wordchar[] = { "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" /* control chars */ "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" /* space through / */ "0123" "4567" "89\0\0" "\0\0\0\0" /* digits, :;<=>? */ "\0abc" "defg" "hijk" "lmno" /* capitals */ "pqrs" "tuvw" "xyz\0" "\0\0\0\0" "\0abc" "defg" "hijk" "lmno" /* lower case */ "pqrs" "tuvw" "xyz\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" "\0\0\0\0" }; int current_doc_id; /* hot spot: takes up 18% of program's CPU time */ void find_words(char *s, int len) { char *t = s; static char word[MAXWORDLEN]; int i; for (;;) { skip_nonword_characters: for (;;) { if (t >= s + len) return; if (wordchar[(unsigned char)*t]) break; t++; } /* scan word */ i = 0; while (t < s + len && wordchar[(unsigned char)*t]) { if (i == MAXWORDLEN) { while (t < s + len && wordchar[(unsigned char)*t]) t++; goto skip_nonword_characters; } word[i] = wordchar[(unsigned char)*t]; i++; t++; } add_posting_to_wordentry_list(index_hash + hash_word(word, i), word, i, current_doc_id); } } void index_by_mail_messages(char *s, int len) { char *new_msg_start; /* in this function, this is the offset of the message start */ current_doc_id = 0; for (;;) { new_msg_start = findfrom(s + current_doc_id, len - current_doc_id); if (!new_msg_start) new_msg_start = s + len; find_words(s + current_doc_id, new_msg_start - s - current_doc_id); if (new_msg_start == s + len) break; current_doc_id = new_msg_start - s + 1; } } /* sorting the output by term */ void push(wordentry **stack, wordentry *item) { item->next = *stack; *stack = item; } wordentry *pop(wordentry **stack) { wordentry *rv = *stack; *stack = rv->next; /* should probably rv->next = NULL; */ return rv; } wordentry *nconc(wordentry *shortlist, wordentry *longlist) { wordentry **join; if (!shortlist) return longlist; join = &(shortlist->next); while (*join) join = &((*join)->next); *join = longlist; return shortlist; } wordentry *all_entries() { int i; wordentry *rv = NULL; for (i = 0; i < HASHSIZE; i++) { rv = nconc(index_hash[i], rv); index_hash[i] = 0; } return rv; } /* sort linked lists by moving entries to different lists. No * robustness against quicksort's O(N^2) worst case, because the input * data is sorted by hash value, then by recentness of occurrence, so * is very unlikely to be nearly sorted already. Hotspot; 14% of run * time is here. */ wordentry *quicksort_wordentries(wordentry *list) { wordentry *first, *before = NULL, *after = NULL, *wep; int cmp; if (!list) return list; first = pop(&list); while (list) { wep = pop(&list); cmp = memcmp(wep->word, first->word, min(wep->wordlen, first->wordlen)); if (!cmp && wep->wordlen < first->wordlen) cmp = -1; push((cmp < 0 ? &before : &after), wep); } first->next = quicksort_wordentries(after); return nconc(quicksort_wordentries(before), first); } /* main */ int main(int argc, char **argv) { int length; int fh = open_or_die(argv[1], &length); char *mm = mmap(0, length, PROT_READ, MAP_SHARED, fh, 0); if (mm == MAP_FAILED) { perror("mmap"); return 2; } arena_ptr = arena = malloc(512 * 1024 * 1024); if (!arena) { perror("malloc"); return 3; } index_by_mail_messages(mm, length); munmap(mm, length); dump_wordentry_list(quicksort_wordentries(all_entries())); return 0; }