The old htsearch "trojka" (or more if not only using "exact:1")
of files, db.docdb, db.docs.index and db.words.db, has been
"downsized" to just two: db.docdb and db.words.db.

The rationale is that the extra trip to db.docs.index took time
when searching and was not really necessary.  The db.docdb file
can just as well be indexed by DocID as by URL.  I guess the
initial choice of index-by-URL was done only because it was the
most straightforward way to do it.

Since optimizing the behavior of searching should be the primary
objective, this was changed.  db.docdb is now indexed by DocID
instead of by URL.  The db.docs.index now has the opposite
translation than before; it now translates from URL to DocID,
and is used by htdig and htmerge, instead of htmerge and
htsearch.  The db.words.db is as before.

For digging, this means some extra overhead:  A translation from
URL to DocID has to be performed when checking for presence of a
document.  (Where local_urls is not used, the network access
normally takes substantially longer time than this new overhead.)

When searching, this should generally be a win.  It may lose in
those cases where most of the documents are "sorted out" by
exclude_urls or include_urls (or equivalent), since the bigger
database db.docdb must now be accessed to find the URL rather
than the smaller (previous) db.docs.index, before throwing it
away.  Indexing on the DocID gives the search engine better
opportunities to optimize the indexing algorithm for db.docdb,
since it the index is initially just the record number.

Some small measurements I did, verify the theory.  Well, "small"
is relative; I had to scale from a db.docdb in the range 1:s to
the 10:s of MB to notice.  The search returned about 10K
documents.  Digging is now about 5-10% slower.  Searching is
about 30-50% faster (user time and wall-clock time alike; system
time difference was inconclusive).  The wall-clock time is the
most interesting; the lesser disk reads show up here, not in
system or user time.  The decrease in user time is most probably
due to handling smaller data.

It now also uses 25% less heap (for example 2.8 -> 2MB), which I
believe is due to the change in size of the key field from an
URL to a number in the match-list, the main memory consumer in
htsearch.


Random fallout:
There's a quirk with the ResultList class; it handles DocIDs as
strings, not int:s.  I did not change this, but performance
(speed as well as memory consumption) would probably be further
improved if it was changed to handle int:s.

I removed the extra round-trip to the database for the
DocumentDB::URLs() (and the new DocIDs()); one used to get the
list of keys, then make a probe to see if the key was valid.
That seems unnecessary, even more so now, and there was no
comment telling me why that had to be done.  Surely, the keys
returned from the DB are valid (and not deleted or some-such).

This means you'll have to re-index your databases if you insist
on using the bleeding edge, sorry.
 And this leaves the external database access programs (like
whatsnew.pl) high and dry.  More similar changes will come
(or so I've heard ;-)

brgds, H-P

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to