At 5:16 PM -0400 6/7/01, Marcio Marchini wrote:
> For instance, is it keyed by word and the associated value is a list
>of document pointers ? Do you use numbers to identify these doc.
>pointers ? Do you do any compression of the list, like just storing
>the delta value from the previous number/pointer, and then using
>variable-bit encoding to represent these deltas ? etc.
Unfortunately, the Berkeley DB (in particular the B-Tree) is a bit
tricky to use. Obviously you'd like to keep the value as a list of
location pointers (word position and doc ID) and compress them
essentially as you mention. But then if you're building the database
on-the-fly, you get a horrible speed hit as you try to replace lists
as you add on new documents.
So the htword/mifluz code uses a more unconventional approach--store
all the words as separate keys (ugh!) but compress the keys based on
prefix. So if we have a node "htdig" and then "htdig.org" under it,
the latter can be stored something like as "#.org"
There are some additional tricks thrown in to deal with keeping the
level of branching managable, etc.
--
--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html