>> Why is this feature absolutely critical for htdig?
>
> We're trying to generate our word database on-the-fly. There are
> essentially two options:
> 1) Use each unique word in the collection as a key and keep the
> document list as the data, including the necessary location information.
> 2) Use each *word* in the collection (possibly paired with a document ID)
> as a key and the location information as the data. This obviously involves
> the DB_DUP flag and storing many duplcate keys.
>
> The problem we're facing is that #1 on a dynamically generated index
> involves expanding variable-length lists and it's painfully slow with the
> current Berkeley DB code.
I can believe that -- the document lists are going to be long,
and probably off-page. Once you move data items off-page in
Berkeley DB, it gets much slower..
> On the other hand, #2 would involve a *lot* of keys, so we care about
> ensuring fast access as well as space required to store all the keys.
I believe that #2 is the right solution. A change we expect to
make this fall is that we're going to convert our off-page
duplicate sets into B+trees of their own. That should guarantee
reasonable access times regardless of the number of duplicates
that you have. (We have customers with millions of duplicates,
which is why we're doing work in this area.)
> For many reasons, we'd prefer option #1. In particular, it involves fewer
> keys, making it faster to query.
Do you have numbers proving this? Btree fan-out is very aggressive, and
it takes a lot of keys before you should see any difference in the query
speed.
And, you're going to pay a price to break up the document lists yourself,
which isn't inexpensive.
And, it's quite reasonable to want to do things like logical joins: join
the document list for one keyword against the document list for another.
If you have your own document list, you'll have to do that yourself, if
you store it as a set of duplicates, Berkeley DB will do it for you.
> So we're interested in whatever option
> works. Right now, we're trying #2. Fixing that would require some sort of
> key compression.
Well, I'm still not completely convinced -- if you can solve the problem
with $10 of hardware, I'm not sure how much time it's worth. :-)
Regards,
--keith
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Keith Bostic
Sleepycat Software Inc. [EMAIL PROTECTED]
394 E. Riding Dr. +1-978-287-4781
Carlisle, MA 01741-1601 http://www.sleepycat.com
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.