Fetcher timeout

Jonathan Reichhold Mon, 14 Nov 2005 09:40:41 -0800

While testing out Nutch, I've discovered several issues with hangsinside of specific parsers, and realized that the Fetcher code has noconcept of timeout on a thread. From experience in doing whole webcrawls, I've found this to be an essential feature for long-termstability (read hands-off production crawling for large indices)

As I'm coming into this codebase new, does the idea of a Fetch threadtimeout exist (not just HTTP timeout) for a bad parser? If so, howwould I use set it? If not, and looking at the code I believe this tobe true, any issue with adding it?


Saw mentions from Doug Cutting on nutch-general on Oct 29th 2005

"Also, the mapred fetcher has been changed to succeed even when threadshang. Perhaps we should change the 0.7 fetcher similarly? I think weshould probably go even farther, and kill threads which take longer thana timeout to process a url. Thread.stop() is theoretically unsafe, butI've used it in the past for this sort of thing and never tracedsubsequent problems back to it... "

Would agree with doug on this being "unsafe" but used it on largesites. At the very least restarting the fetcher (can this be done)after this point would help get through the list.


Jonathan Reichhold

Fetcher timeout

Reply via email to