Lukas,
the input folder are normally setted by the tools to you can not change that. However in case you use a unix box, check that the user that runs nutch has read and write acess to all the folder defined in the nutch- site/default.xml. (I guess that can be the problem, nutch use e.g. /tmp to write in some data) If this not solve the problem, just run the commands manually step by step, there is a tutorial in the wiki how to run the map rd commands step by step.

Stefan

Am 21.12.2005 um 06:56 schrieb Lukas Vlcek:

Hi,

I am trying to use nutch-0.8-dev and I have a problem with crawl run.
I did checkout from SVN and prepared fresh package (ant package - all
went fine). Then I installed nutch on linux and made only minor
changes to nutch-site.xml file (turned on some plugins and increased
several constansts), prepared file with urls and started bin/nutch
crawl.

This worked for nutch-0.7x but for nutch-0.8-dev I am receiving the
following exception in log file:

051220 204248 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ crawl-tool.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ nutch-site.xml
051220 204249 crawl started in: ./crawl.test
051220 204249 rootUrlDir = urls
051220 204249 threads = 10
051220 204249 depth = 6
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ crawl-tool.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ nutch-site.xml
051220 204249 Injector: starting
051220 204249 Injector: crawlDb: ./crawl.test/crawldb
051220 204249 Injector: urlDir: urls
051220 204249 Injector: Converting injected urls to crawl db entries.
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ crawl-tool.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ nutch-site.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/nutch-default.xml
051220 204249 parsing
file:/home/lukas/nutch/nutch-0.8-dev/conf/mapred-default.xml
051220 204249 parsing /home/lukas/nutch/mapred/local/localRunner/ job_4zwds6.xml 051220 204249 parsing file:/home/lukas/nutch/nutch-0.8-dev/conf/ nutch-site.xml
java.io.IOException: No input directories specified in: NutchConf:
nutch-default.xml , mapred-default.xml ,
/home/lukas/nutch/mapred/local/localRunner/job_4zwds6.xml ,
nutch-site.xml
at org.apache.nutch.mapred.InputFormatBase.listFiles (InputFormatBase.java:85) at org.apache.nutch.mapred.InputFormatBase.getSplits (InputFormatBase.java:95) at org.apache.nutch.mapred.LocalJobRunner$Job.run (LocalJobRunner.java:63)
051220 204249 Running job: job_4zwds6
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java: 308)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:102)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:101)

It seems that the problem is that Nutch is not able to find
mapred.input.subdir setting in neither of config files. I found that
there is mapred.input.dir property defined in config for particular
job (job_4zwds6.xml) with value equal to the name of my urls file but
I don't understand where should I define mapred.input.subdir property
and what value to assign to it (if it needs to be defined manually -
note that mapred.input.dir seems to be configured automatically).

Does anybody know the answer?

p.s: Note that number of lines it the exception trace above for
InputFormatBase.java file (85,95) can differ a bit as I tried to
insert some more LOG.debug() commands there in search of the root
cause and then I removed them again but it is possible that I left
some extra empty lines there.

Thanks,
Lukas




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to