Hi Jérôme,
I have the same problem on a distribute environment! :-(
So I think can confirm this is a bug.
We should fix that.
Stefan
On 06.07.2006, at 08:54, Jérôme Charron wrote:
Hi,
I encountered some problems with Nutch trunk version.
In fact it seems to be related to changes related to Hadoop-0.4.0
and JDK
1.5
(more precisely since HADOOP-129 and File replacement by Path).
In my environment, the crawl command terminate with the following
error:
2006-07-06 17:41:49,735 ERROR mapred.JobClient
(JobClient.java:submitJob(273))
- Input directory /localpath/crawl/crawldb/current in local is
invalid.
Exception in thread "main" java.io.IOException: Input directory
/localpathcrawl/crawldb/current in local is invalid.
at org.apache.hadoop.mapred.JobClient.submitJob
(JobClient.java:274)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:
327)
at org.apache.nutch.crawl.Injector.inject(Injector.java:146)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
By looking at the Nutch code, and simply changing the line 145 of
Injector
by mergeJob.setInputPath(tempDir) (instead of mergeJob.addInputPath
(tempDir))
all is working fine. By taking a closer look at CrawlDb code, I
finaly don"t
understand why there is the following line in the createJob method:
job.addInputPath(new Path(crawlDb, CrawlDatum.DB_DIR_NAME));
For curiosity, if a hadoop guru can explain why there is such a
regression...
Does somebody have the same error?
Regards
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/