The old htsearch "trojka" (or more if not only using "exact:1") of files, db.docdb, db.docs.index and db.words.db, has been "downsized" to just two: db.docdb and db.words.db. The rationale is that the extra trip to db.docs.index took time when searching and was not really necessary. The db.docdb file can just as well be indexed by DocID as by URL. I guess the initial choice of index-by-URL was done only because it was the most straightforward way to do it. Since optimizing the behavior of searching should be the primary objective, this was changed. db.docdb is now indexed by DocID instead of by URL. The db.docs.index now has the opposite translation than before; it now translates from URL to DocID, and is used by htdig and htmerge, instead of htmerge and htsearch. The db.words.db is as before. For digging, this means some extra overhead: A translation from URL to DocID has to be performed when checking for presence of a document. (Where local_urls is not used, the network access normally takes substantially longer time than this new overhead.) When searching, this should generally be a win. It may lose in those cases where most of the documents are "sorted out" by exclude_urls or include_urls (or equivalent), since the bigger database db.docdb must now be accessed to find the URL rather than the smaller (previous) db.docs.index, before throwing it away. Indexing on the DocID gives the search engine better opportunities to optimize the indexing algorithm for db.docdb, since it the index is initially just the record number. Some small measurements I did, verify the theory. Well, "small" is relative; I had to scale from a db.docdb in the range 1:s to the 10:s of MB to notice. The search returned about 10K documents. Digging is now about 5-10% slower. Searching is about 30-50% faster (user time and wall-clock time alike; system time difference was inconclusive). The wall-clock time is the most interesting; the lesser disk reads show up here, not in system or user time. The decrease in user time is most probably due to handling smaller data. It now also uses 25% less heap (for example 2.8 -> 2MB), which I believe is due to the change in size of the key field from an URL to a number in the match-list, the main memory consumer in htsearch. Random fallout: There's a quirk with the ResultList class; it handles DocIDs as strings, not int:s. I did not change this, but performance (speed as well as memory consumption) would probably be further improved if it was changed to handle int:s. I removed the extra round-trip to the database for the DocumentDB::URLs() (and the new DocIDs()); one used to get the list of keys, then make a probe to see if the key was valid. That seems unnecessary, even more so now, and there was no comment telling me why that had to be done. Surely, the keys returned from the DB are valid (and not deleted or some-such). This means you'll have to re-index your databases if you insist on using the bleeding edge, sorry. And this leaves the external database access programs (like whatsnew.pl) high and dry. More similar changes will come (or so I've heard ;-) brgds, H-P ------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to [EMAIL PROTECTED] containing the single word "unsubscribe" in the SUBJECT of the message.
