Lukas,
the input folder are normally setted by the tools to you can not
change that.
However in case you use a unix box, check that the user that runs
nutch has read and write acess to all the folder defined in the nutch-
site/default.xml.
(I guess that can be the problem, nutch use e.g. /tmp to write in
some data)
If this not solve the problem, just run the commands manually step by
step, there is a tutorial in the wiki how to run the map rd commands
step by step.
Stefan
Am 21.12.2005 um 06:56 schrieb Lukas Vlcek:
Hi,
I am trying to use nutch-0.8-dev and I have a problem with crawl run.
I did checkout from SVN and prepared fresh package (ant package - all
went fine). Then I installed nutch on linux and made only minor
changes to nutch-site.xml file (turned on some plugins and increased
several constansts), prepared file with urls and started bin/nutch
crawl.
This worked for nutch-0.7x but for nutch-0.8-dev I am receiving the
following exception in log file:
051220 204248 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/
crawl-tool.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/
nutch-site.xml
051220 204249 crawl started in: ./crawl.test
051220 204249 rootUrlDir = urls
051220 204249 threads = 10
051220 204249 depth = 6
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/
crawl-tool.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/
nutch-site.xml
051220 204249 Injector: starting
051220 204249 Injector: crawlDb: ./crawl.test/crawldb
051220 204249 Injector: urlDir: urls
051220 204249 Injector: Converting injected urls to crawl db entries.
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/
crawl-tool.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/
nutch-site.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing /home/lukas/nutch/mapred/local/localRunner/
job_4zwds6.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/
nutch-site.xml
java.io.IOException: No input directories specified in: NutchConf:
nutch-default.xml , mapred-default.xml ,
/home/lukas/nutch/mapred/local/localRunner/job_4zwds6.xml ,
nutch-site.xml
at org.apache.nutch.mapred.InputFormatBase.listFiles
(InputFormatBase.java:85)
at org.apache.nutch.mapred.InputFormatBase.getSplits
(InputFormatBase.java:95)
at org.apache.nutch.mapred.LocalJobRunner$Job.run
(LocalJobRunner.java:63)
051220 204249 Running job: job_4zwds6
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:
308)
at org.apache.nutch.crawl.Injector.inject(Injector.java:102)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:101)
It seems that the problem is that Nutch is not able to find
mapred.input.subdir setting in neither of config files. I found that
there is mapred.input.dir property defined in config for particular
job (job_4zwds6.xml) with value equal to the name of my urls file but
I don't understand where should I define mapred.input.subdir property
and what value to assign to it (if it needs to be defined manually -
note that mapred.input.dir seems to be configured automatically).
Does anybody know the answer?
p.s: Note that number of lines it the exception trace above for
InputFormatBase.java file (85,95) can differ a bit as I tried to
insert some more LOG.debug() commands there in search of the root
cause and then I removed them again but it is possible that I left
some extra empty lines there.
Thanks,
Lukas