I wrote up some possible database layouts for ht://Dig. For each, I gave
some advantages and disadvantages. My hope is that we can agree on a good
layout and not have to worry about it. Since I'm not versed in SQL, I
assumed we'd be sticking with the db package.
Thoughts?
-Geoff
* Plan 0:
Continue current database layout with minor enhancements.
Advantages:
Mostly compatible with previous versions
Disadvantages:
Not particularly efficient
* Plan 1:
db.docdb (filed by DocID, B-tree)
db.docurl (url -> DocID, used during htdig, Hash)
db.docsig (signature -> DocID, used during htdig, Hash)
db.worddb (similar to current word index)
Advantages:
Fewer necessary databases (2)
No need for htmerge pass (code optimized to do all transactions in htdig)
Parallel indexing and searching (htdig ensures consistent databases)
Faster searches (no need for DocID -> URL lookup in search)
Faster digging (db.docurl has faster lookups than currently)
Support for removing duplicates (db.docsig stores signatures (some sort of
checksum) for ignoring duplicates).
Disadvantages:
No concrete phrase support
No concrete support for "fields" or searching attributes
* Plan 2:
As plan 1, but db.worddb contains records for *every* word in a document,
including location.
Advantages:
Support for phrase, near, before, and after searching with locations
Disadvantages:
Probably much larger, though how *much* larger?
* Plan 3:
As plan 1, but db.worddb contains records for *every* word with word ids of
"before" and "after"
Advantages:
Support for phrase, before, and after searching
Possibly smaller than plan 2
Disadvantages:
No support for "near" searching
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.