Doug Cutting wrote:
Jérôme Charron wrote:
> In my environment, the crawl command terminate with the following
> error: 2006-07-06 17:41:49,735 ERROR mapred.JobClient
> (JobClient.java:submitJob(273)) - Input directory
> /localpath/crawl/crawldb/current in local is invalid. Exception in
> thread "main" java.io.IOException: Input directory
> /localpathcrawl/crawldb/current in local is invalid. at
> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274) at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327) at
> org.apache.nutch.crawl.Injector.inject(Injector.java:146) at
> org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
Hadoop 0.4.0 by default requires all input directories to exist,
where previous releases did not. So we need to either create an
empty "current" directory or change the InputFormat used in
CrawlDb.createJob() to be one that overrides
InputFormat.areValidInputDirectories(). The former is probably
easier. I've attached a patch. Does this fix things for folks?
Patch works for me.
--
Sami Siren