Lukas,
the input folder are normally setted by the tools to you can not change that. However in case you use a unix box, check that the user that runs nutch has read and write acess to all the folder defined in the nutch- site/default.xml. (I guess that can be the problem, nutch use e.g. /tmp to write in some data) If this not solve the problem, just run the commands manually step by step, there is a tutorial in the wiki how to run the map rd commands step by step.

Stefan

Am 21.12.2005 um 06:56 schrieb Lukas Vlcek:

Hi,

I am trying to use nutch-0.8-dev and I have a problem with crawl run.
I did checkout from SVN and prepared fresh package (ant package - all
went fine). Then I installed nutch on linux and made only minor
changes to nutch-site.xml file (turned on some plugins and increased
several constansts), prepared file with urls and started bin/nutch
crawl.

This worked for nutch-0.7x but for nutch-0.8-dev I am receiving the
following exception in log file:

051220 204248 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ crawl-tool.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ nutch-site.xml
051220 204249 crawl started in: ./crawl.test
051220 204249 rootUrlDir = urls
051220 204249 threads = 10
051220 204249 depth = 6
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ crawl-tool.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ nutch-site.xml
051220 204249 Injector: starting
051220 204249 Injector: crawlDb: ./crawl.test/crawldb
051220 204249 Injector: urlDir: urls
051220 204249 Injector: Converting injected urls to crawl db entries.
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ crawl-tool.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ nutch-site.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing /home/lukas/nutch/mapred/local/localRunner/ job_4zwds6.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ nutch-site.xml
java.io.IOException: No input directories specified in: NutchConf:
nutch-default.xml , mapred-default.xml ,
/home/lukas/nutch/mapred/local/localRunner/job_4zwds6.xml ,
nutch-site.xml
at org.apache.nutch.mapred.InputFormatBase.listFiles (InputFormatBase.java:85) at org.apache.nutch.mapred.InputFormatBase.getSplits (InputFormatBase.java:95) at org.apache.nutch.mapred.LocalJobRunner$Job.run (LocalJobRunner.java:63)
051220 204249 Running job: job_4zwds6
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java: 308)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:102)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:101)

It seems that the problem is that Nutch is not able to find
mapred.input.subdir setting in neither of config files. I found that
there is mapred.input.dir property defined in config for particular
job (job_4zwds6.xml) with value equal to the name of my urls file but
I don't understand where should I define mapred.input.subdir property
and what value to assign to it (if it needs to be defined manually -
note that mapred.input.dir seems to be configured automatically).

Does anybody know the answer?

p.s: Note that number of lines it the exception trace above for
InputFormatBase.java file (85,95) can differ a bit as I tried to
insert some more LOG.debug() commands there in search of the root
cause and then I removed them again but it is possible that I left
some extra empty lines there.

Thanks,
Lukas


Reply via email to