OK, I have the beginnings of the new word database code on my drive. I
haven't updated htmerge or htsearch yet, so I'm not going to commit it to
the tree just yet. Hopefully I'll have time to do that tomorrow.

The key benefits of this code are that no sorting is needed, every word of
every document is indexed with location (for phrase searching), and the
databases don't require a separate merge phase to prepare them for
searching. (Hopefully you could dig on live databases, but without the
updated htsearch, I can't really test that ;-)

Here are the stats on the database sizes for indexing the first 100 pages
of www.htdig.org. I don't have times, but the 3.2 prototype feels
significantly slower. I hope that's just the difference between compiling
with -g and -O3, but I'll take a look for performance problems tomorrow...

Digging (and merging) with 3.1.2:
-rw-rw-r--   1 ghutchis ghutchis  1591296 Jul 11 00:41 db.docdb
-rw-rw-r--   1 ghutchis ghutchis     8192 Jul 11 00:41 db.docs.index
-rw-rw-r--   1 ghutchis ghutchis   846477 Jul 11 00:41 db.wordlist
-rw-rw-r--   1 ghutchis ghutchis  1052672 Jul 11 00:41 db.words.db

Total (K): 3436
Total w/o wordlist (K): 2604


Digging with 3.2 prototype:
-rw-rw-r--   1 ghutchis ghutchis   687104 Jul 11 00:39 db.docdb
-rw-rw-r--   1 ghutchis ghutchis   328704 Jul 11 00:39 db.docs.index
-rw-rw-r--   1 ghutchis ghutchis   583680 Jul 11 00:39 db.excerpts
-rw-rw-r--   1 ghutchis ghutchis   394240 Jul 11 00:39 db.words.db

Total (K): 1777
(deleting db.docs.index is possible, but not a big savings)

I'm rather surprised by this. I thought that storing every word would
bloat the word db... Instead, it's about 40% the size of the original! I'm
hoping I don't have a blatant bug, but my guess is that the database can
compress the separate words more efficiently since each record is shorter
(remember, the previous version used a list of document ID/weights as each
record).

I hope to wrap this up quickly so we can start hammering on it and looking
for performance problems. If this isn't the right direction, we need to
decide that soon.

-Geoff

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to