If you download the latest trunk copy of 0.8, bin/nutch will not even be
available.. is this supposed to be this way?
Matt
Bryan Woliner wrote:
I am certainly far from a nutch expert, but it appears to me that
there are
two errors in the current Nutch 0.8 tutorial.
First off, here is the version of Nutch 0.8 that I am using, in case
there
has been changes made in newer version that invalidate my comments:
-bash-2.05b$ svn info
Path: .
URL: http://svn.apache.org/repos/asf/lucene/nutch/trunk
Repository Root: http://svn.apache.org/repos/asf
Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
Revision: 414318
Node Kind: directory
Schedule: normal
Last Changed Author: siren
Last Changed Rev: 414306
Last Changed Date: 2006-06-14 11:08:28 -0500 (Wed, 14 Jun 2006)
Properties Last Updated: 2006-06-14 12:00:57 -0500 (Wed, 14 Jun 2006)
Error #1:
Towards the end of the tutorial, the following command is found:
bin/nutch invertlinks crawl/linkdb crawl/segments
When I call this command verbatim, I get the following error:
2006-07-25 08:44:40,503 WARN mapred.LocalJobRunner
(LocalJobRunner.java:run(119))
- job_8ly5hf
java.io.IOException: No input directories specified in: Configuration:
defaults: hadoop-default.xml , mapred-default.xml ,
/home/bryan/nutch-8d/hadoop/mapred/local/localRunner/job_8ly5hf.xmlfinal:
hadoop-site.xml
at org.apache.hadoop.mapred.InputFormatBase.listPaths(
InputFormatBase.java:96)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listPaths(
SequenceFileInputFormat.java:37)
at org.apache.hadoop.mapred.InputFormatBase.getSplits(
InputFormatBase.java:106)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
LocalJobRunner.java:80)
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:342)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:203)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:305)
I think the correct syntax for the command should be:
bin/nutch invertlinks crawl/linkdb crawl/segments/* (with the /* added
to the end).
Error #2:
The tutorial says that to index, the following command should be called:
bin/nutch index indexes crawl/linkdb crawl/segments/*
However, when I call that command I get the following error:
Usage: <index> <crawldb> <linkdb> <segment> ...
I believe the correct syntax should be:
bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
If these are indeed errors in the tutorial, perhaps someone with the
authority to do so would be kind enough the make the necessary
changes.
My two cents,
Bryan