[EMAIL PROTECTED] wrote:
> Ouch. A 40% increase is not as big as one might have expected. I guess
> it also depends on the maximum allowed word length and the nature of the
> indexed data. The average english word length being 4 char, storing the
> word is just like storing an additional integer.
Actually the Berkeley DB version 2 code is pretty good about this. In
the BTree, it does prefix compression so that duplicate keys require
very little extra storage. They also throw in pretty fine-grained
locking and database caching that makes it fairly fast to write.
> In addition I doubt very much that this ogranisation will allow us to
> efficiently handle 10 000 000 documents containing an average of 200
> words, that is a B-Tree of 2 giga nodes.
True. However, the problem is that the performance overhead of adding a
duplicate key is fairly low compared with updating the posting list. It
was easy enough to generate the list using htmerge, but since we'd
*REALLY* like to generate the word index on-the-fly, we need a good
solution for this.
Let's say we want to have a word index where the data is the list of
fine-grained locations. I'm going to be abstract, but essentially, the
list will specify where every word location is for phrase searching.
Now let's say we want to add a document. Obviously we'll need to go
through each key and update the list. So we'll need to do a database
lookup and a retrieval to get the old list. Then we probably need at
least some decoding of the list itself (hopefully not much). Then we add
the new document and it's words, possibly do some recoding of the list
and ask the database to update the data.
The problem is this last part... I don't pretend to know much about the
Berkeley internals, but several people (Andrew Scherpbier among them)
warned me that adding data to a pre-existing key would be slow. They
weren't kidding. It's fast enough to make up the list from another file
(the db.wordlist -> db.words.db stuff we've been going through). But
updating it on the fly was painful--I think it realizes the new data
won't fit in the hole left by the previous data, so it has to expand the
database, delete the old data and put the new data in.
Thus the nature of the current WordList code. Store a duplicate key
because that requires less database overhead (for a while). I'm not
particularly happy with this solution either, but it seems like we're
stuck between a rock and a hard place. (I don't know how well that
translates--some people reminded me that American slang like 'yup'
doesn't.)
-Geoff
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.