> How much of this database fragmentation would be due to the fact that
> there are records of different lengths, and how much would be due to
> updating a given record from one length to a larger length.
>
> E.g., if instead of having a whole bunch of entries like this...
>
>   word // DocID // flags // location -> anchor
>
> what if we had entries like this...
>
>   word // DocID -> flags/location/anchor flags/location/anchor ...

   Keeping the location code in the key eliminates duplicate keys, which
probably helps BDB a little.  The rest can go into the value.

> but instead of making database updates each time another word is parsed
> (as is done now in 3.2, if I'm not mistaken), how about if htdig stored
> this information in memory as it did in 3.1, and then just dumped
> out all the records like above after the whole document is parsed.
> That way, none of the records ever have to be updated and lengthened.
> They're just written once.

        I implemented this type of caching very quickly using the STL with
slight modifications to the WordDB object.  Mifluz contains a WordDBCache
object, but 3.2 doesn't use it, and it's excessively complicated in
my opinion.

        I'm still evaluating the results of this kind of caching, but at
first glance it seems to help a lot.  I also added a few lines of code to
flush this cache every 250 documents, which makes it even faster.

> I think even optimizations like this become easier if we don't dump out
> any of the db.words.db records until a document is fully parsed, and then
> dump them all out at once.  Am I wrong?  I know that 3.2 is supposed to
> allow indexing a live database on the fly, and still have it be searchable,
> but that doesn't mean the DB needs to be updated a word at a time.  Doing
> it a document at a time should make sense, just as db.docdb is updated.

        I'll try to submit a patch early this week with the caching
added to the WordDB object, flushing the cache every X documents, and
improving the efficiency of the word_db_cmp function.

See my previous post on word_db_cmp, which I wrote after having a few
beers.. the writing is a bit tangled ;-)

Thanks!

Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485





-------------------------------------------------------
This SF.net email is sponsored by: ApacheCon, November 18-21 in
Las Vegas (supported by COMDEX), the only Apache event to be
fully supported by the ASF. http://www.apachecon.com
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to