All,
I'm having some trouble with the Nutch nightly. It has been a while
since I last updated my crawl of our intranet. I was attempting to run
the crawl today and it failed with this:
Exception in thread "main" java.io.IOException: Job failed!
at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
at
org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:142)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:126)
In the web interface it says that:
Task task_200711261211_0026_m_000015_0 filed to report status for 602
seconds. Killing!
Task task_200711261211_0026_m_000015_1 filed to report status for 601
seconds. Killing!
Task task_200711261211_0026_m_000015_2 filed to report status for 601
seconds. Killing!
Task task_200711261211_0026_m_000015_3 filed to report status for 602
seconds. Killing!
I don't have the fetchers set to parse. Nutch and hadoop are running
on a 3 node cluster. I've attached the job configuration file as saved
from the web interface.
Is there any way I can get more information on which file or url the
parse is failing on? Why doesn't the parsing of a file or URL fail
more cleanly?
Any recommendations on helping nutch avoid whatever is causing the hang
and allowing it to index the rest of the content?
Thanks.
Jeff Bolle