Kir Kolyshkin
Mon, 23 Sep 2002 00:36:59 -0700
The best solution is to run many threads (say, -N 50 is not that bad). If one site will be unavailable, 1 thread will try to reach it, and other 49 threads will continue running fine, so you will have 2% indexing speed decrease, which seems to be OK to me.
J and T wrote: > Today I sent a crawl of about 200,000 URLs. One of the sites contained > about 2,000 URLs is no longer an active site. They closed their doors. > When indexer is running it responds with "Can't connect to host". It > seems DNS records are still active (never removed) for the domain, but > the site is not operational. The problem is that index still tries to > connect to this host for every single page in the index. Because we > don't time out for like 90 seconds, index pretty much hangs forever. > Sure if I monitor index 24/7 I guess I could halt its operation and then > do an ./index -C "http//sitenmae%" and then start the process all over, > but I'm not always sitting there. > > The best solution would be if index had the ability to mark all URLs as > status 500 so the indexer would't hang on all future URLs requested for > this domain it would certainly improve performance. > > Just a suggestion, > John > > > _________________________________________________________________ > Chat with friends online, try MSN Messenger: http://messenger.msn.com > -- -- [EMAIL PROTECTED] ICQ7551596 [EMAIL PROTECTED] -- Guinness a Day Keeps a Doctor Away (people's wisdom)