Jim said:

> Did you try setting ignore_dead_servers to false as I suggested earlier?

Yes, I added that and it did help a bit. Did a crawl that conked out after
about 33,000 pages. So at least we doubled the number.

I'm trying to weed out a few more duplicates to try and reduce the crawl
size to see if that helps. I've added more directories to the exclude
list.

Does the exclude list allow regex?

The reason I ask is that, for example sake, a story about the new Htdig
movie may appear in:

/news/main/2004/08/30/htdig-casting
/news/hughgrant/2004/08/30/htdig-casting
/news/angelinajolie/2004/08/30/htdig-casting
/news/melgibson/2004/08/30/htdig-casting

Needless to say, I don't need to index that story four times. But since
the list of people's continuosly growing, I can't keep heading back into
the conf file and add each and every /news/celebname

Thanks again.



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to