> How much of this database fragmentation would be due to the fact that > there are records of different lengths, and how much would be due to > updating a given record from one length to a larger length. > > E.g., if instead of having a whole bunch of entries like this... > > word // DocID // flags // location -> anchor > > what if we had entries like this... > > word // DocID -> flags/location/anchor flags/location/anchor ...
Keeping the location code in the key eliminates duplicate keys, which probably helps BDB a little. The rest can go into the value. > but instead of making database updates each time another word is parsed > (as is done now in 3.2, if I'm not mistaken), how about if htdig stored > this information in memory as it did in 3.1, and then just dumped > out all the records like above after the whole document is parsed. > That way, none of the records ever have to be updated and lengthened. > They're just written once. I implemented this type of caching very quickly using the STL with slight modifications to the WordDB object. Mifluz contains a WordDBCache object, but 3.2 doesn't use it, and it's excessively complicated in my opinion. I'm still evaluating the results of this kind of caching, but at first glance it seems to help a lot. I also added a few lines of code to flush this cache every 250 documents, which makes it even faster. > I think even optimizations like this become easier if we don't dump out > any of the db.words.db records until a document is fully parsed, and then > dump them all out at once. Am I wrong? I know that 3.2 is supposed to > allow indexing a live database on the fly, and still have it be searchable, > but that doesn't mean the DB needs to be updated a word at a time. Doing > it a document at a time should make sense, just as db.docdb is updated. I'll try to submit a patch early this week with the caching added to the WordDB object, flushing the cache every X documents, and improving the efficiency of the word_db_cmp function. See my previous post on word_db_cmp, which I wrote after having a few beers.. the writing is a bit tangled ;-) Thanks! Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 ------------------------------------------------------- This SF.net email is sponsored by: ApacheCon, November 18-21 in Las Vegas (supported by COMDEX), the only Apache event to be fully supported by the ASF. http://www.apachecon.com _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev