Hello Torsten,

> As of 3.1.2 there was already a patch solution for this which has been
> incorporated into 3.1.4 and which is much cleaner than just renaming
> REQUEST_METHOD.  In other words, you applied a patch for something the
> search engine is already able to do ;-)

Grin! How do I utilize this feature? When I passed the query_string as an
argument to htsearch, it ignored it and instead detected the REQUEST_METHOD
environment variable in place for the encapsulating CGI and grabbed the
actual QUERY_STRING variable.

> I'd rather work around that by trapping the site response codes prior to
> indexing the sites using a tool like DLC (dead-link-check).

This sounds like a good approach. I observed one complicating factor in that
some of the temporarily "dead servers" varied from test run to test run.
Checking in advance for several thousand servers (eventually, tens of
thousands) would allow enough time for a new batch of dead servers to crop
up. I'm amazed at the number of overburdened web servers out there. I'm also
sensitive to bandwidth issues and want to keep access to a minimum (I've
even removed the robots.txt retrieval logic because I'm not actually using
htdig to spider pages).

> If you have plenty of disk space, I'd even have a single small database
> for every site being indexed (and have them merged after the index run),
> in which case you can run multiple instances of the indexer concurrently
> (you can then have a merger process waiting for new input to be merged
> into the new search database).  That should further increase the speed
> of the in-dexer process.

Another good approach. However, unless I am misunderstanding your
suggestion, Linux hates having thousands of files in a single directory and
its performance is severely penalized during file i/o in this case. Not a
serious problem, but you'd have to build more management logic to spread the
files throughout many directories (i.e., /dbs/o/on/onesite.db,
/dbs/t/tw/twosite.db, etc.). I think I'll just wait until htdig is
multi-threaded ;).

> Regarding the time-out settings, I think that this heavily depends upon
> the production system used.  If you have good routes to every site, you
> will probably be fine with it.  If not, it might cause some trouble.

Thanks! This is encouraging. Luckily, routes don't seem to be a serious
issue as the machine is sitting at a farm with eight backbone connections to
the Internet. I'll keep an eye out though.

All the best,

Sean.

# Digital Spinner, Inc.
# Web Design, Development and Consulting.
# Phone: 802.948.2020
# Fax: 802.948.2749
# http://www.digitalspinner.com


------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.

Reply via email to