OK, I have the beginnings of the new word database code on my drive. I haven't updated htmerge or htsearch yet, so I'm not going to commit it to the tree just yet. Hopefully I'll have time to do that tomorrow. The key benefits of this code are that no sorting is needed, every word of every document is indexed with location (for phrase searching), and the databases don't require a separate merge phase to prepare them for searching. (Hopefully you could dig on live databases, but without the updated htsearch, I can't really test that ;-) Here are the stats on the database sizes for indexing the first 100 pages of www.htdig.org. I don't have times, but the 3.2 prototype feels significantly slower. I hope that's just the difference between compiling with -g and -O3, but I'll take a look for performance problems tomorrow... Digging (and merging) with 3.1.2: -rw-rw-r-- 1 ghutchis ghutchis 1591296 Jul 11 00:41 db.docdb -rw-rw-r-- 1 ghutchis ghutchis 8192 Jul 11 00:41 db.docs.index -rw-rw-r-- 1 ghutchis ghutchis 846477 Jul 11 00:41 db.wordlist -rw-rw-r-- 1 ghutchis ghutchis 1052672 Jul 11 00:41 db.words.db Total (K): 3436 Total w/o wordlist (K): 2604 Digging with 3.2 prototype: -rw-rw-r-- 1 ghutchis ghutchis 687104 Jul 11 00:39 db.docdb -rw-rw-r-- 1 ghutchis ghutchis 328704 Jul 11 00:39 db.docs.index -rw-rw-r-- 1 ghutchis ghutchis 583680 Jul 11 00:39 db.excerpts -rw-rw-r-- 1 ghutchis ghutchis 394240 Jul 11 00:39 db.words.db Total (K): 1777 (deleting db.docs.index is possible, but not a big savings) I'm rather surprised by this. I thought that storing every word would bloat the word db... Instead, it's about 40% the size of the original! I'm hoping I don't have a blatant bug, but my guess is that the database can compress the separate words more efficiently since each record is shorter (remember, the previous version used a list of document ID/weights as each record). I hope to wrap this up quickly so we can start hammering on it and looking for performance problems. If this isn't the right direction, we need to decide that soon. -Geoff ------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to [EMAIL PROTECTED] containing the single word "unsubscribe" in the SUBJECT of the message.