All,
I'm having some trouble with the Nutch nightly.  It has been a while
since I last updated my crawl of our intranet.  I was attempting to run
the crawl today and it failed with this:
Exception in thread "main" java.io.IOException: Job failed!
        at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
        at
org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:142)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:126)

In the web interface it says that:
Task task_200711261211_0026_m_000015_0 filed to report status for 602
seconds. Killing!
Task task_200711261211_0026_m_000015_1 filed to report status for 601
seconds. Killing!
Task task_200711261211_0026_m_000015_2 filed to report status for 601
seconds. Killing!
Task task_200711261211_0026_m_000015_3 filed to report status for 602
seconds. Killing!
 
I don't have the fetchers set to parse.  Nutch and hadoop are running
on a 3 node cluster.  I've attached the job configuration file as saved
from the web interface.
 
Is there any way I can get more information on which file or url the
parse is failing on?  Why doesn't the parsing of a file or URL fail
more cleanly?
 
Any recommendations on helping nutch avoid whatever is causing the hang
and allowing it to index the rest of the content?
 
Thanks.
 
 
Jeff Bolle
 

Reply via email to