OK, I've been swamped. I didn't get a chance yet to try out my changes to
put the DocHead fields in a separate database. I hope to get to it soon.
But I've been working out how to redesign the word database. Here's a
current sketch.

db.words
 word -> WordRef: (DocID, Position, Flags, Anchor)

In English, there's a word ref for every word in every document. Every
*word*, not just every unique word like now. This will naturally make a
larger db, if it continues to be uncompressed. Some will come from the
database compressing keys, which it currently does not.

DocID: Same as in db.docdb
Position: I'd suggest an unsigned 16-bit int first. This is the word
position in
        the document. Used for phrase and near searching and the like.
Flags: Either 16 bits or 32 bits. See below.
Anchor: Same as current. If this word is after an anchor, store it.

Flags. This is how I propose to deal with searches by field. As a bonus, it
gives us on-the-fly weighting factors. Put simply, you define a base set of
flags and leave the rest up to the user to define in the config file.
Base set: (flag #, then int)
0 0 - plain text
1 1 - title
2 2 - header
3 4 - keyword
4 8 - description (e.g. META)
5 16 - link description
6 32 - author
7 64 - subject
8+ 128+ - undefined

Scoring then becomes a set of bit tests, summing the config-file defined
factors.

Another "bonus" of this approach is that the db.words file can be generated
while indexing, relegating htmerge to simply merging databases and allowing
for parallel indexing and searching. In fact, by removing the need for
.work files, we might "buy" enough disk space for the larger db!

-Geoff


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to