Hi,
I encountered some problems with Nutch trunk version.
In fact it seems to be related to changes related to Hadoop-0.4.0 and JDK
1.5
(more precisely since HADOOP-129 and File replacement by Path).
In my environment, the crawl command terminate with the following error:
2006-07-06 17:41:49,735 ERROR mapred.JobClient (JobClient.java:submitJob(273))
- Input directory /localpath/crawl/crawldb/current in local is invalid.
Exception in thread "main" java.io.IOException: Input directory
/localpathcrawl/crawldb/current in local is invalid.
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
at org.apache.nutch.crawl.Injector.inject(Injector.java:146)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
By looking at the Nutch code, and simply changing the line 145 of Injector
by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
(tempDir))
all is working fine. By taking a closer look at CrawlDb code, I finaly don"t
understand why there is the following line in the createJob method:
job.addInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
For curiosity, if a hadoop guru can explain why there is such a
regression...
Does somebody have the same error?
Regards
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/