While testing out Nutch, I've discovered several issues with hangs inside of specific parsers, and realized that the Fetcher code has no concept of timeout on a thread. From experience in doing whole web crawls, I've found this to be an essential feature for long-term stability (read hands-off production crawling for large indices)

As I'm coming into this codebase new, does the idea of a Fetch thread timeout exist (not just HTTP timeout) for a bad parser? If so, how would I use set it? If not, and looking at the code I believe this to be true, any issue with adding it?

Saw mentions from Doug Cutting on nutch-general on Oct 29th 2005

"Also, the mapred fetcher has been changed to succeed even when threads hang. Perhaps we should change the 0.7 fetcher similarly? I think we should probably go even farther, and kill threads which take longer than a timeout to process a url. Thread.stop() is theoretically unsafe, but I've used it in the past for this sort of thing and never traced subsequent problems back to it... "

Would agree with doug on this being "unsafe" but used it on large sites. At the very least restarting the fetcher (can this be done) after this point would help get through the list.

Jonathan Reichhold

Reply via email to