Re: [jira] Created: (NUTCH-175) No input directories specified in: while crawing in nightly build from the 14.1.2006: sh ./nutch crawl urllist.txt -dir tmpdir

Dominik Friedrich Sun, 15 Jan 2006 02:40:03 -0800

You have to put your urllist.txt into a directory and use that as theargument. The mapred version expects always directories as input sources.


regards
Dominik



Matthias Günter (JIRA) schrieb:

No input directories specified in: while crawing in nightly build from the 
14.1.2006: sh ./nutch crawl urllist.txt -dir tmpdir
------------------------------------------------------------------------------------------------------------------------------

         Key: NUTCH-175
         URL: http://issues.apache.org/jira/browse/NUTCH-175
     Project: Nutch
        Type: Bug
 Environment: SUSE Linux 9.3
    Reporter: Matthias Günter
    Priority: Trivial


[EMAIL PROTECTED]:~/workspace/lucene/nutch-nightly/bin> sh ./nutch crawl 
urllist.txt -dir tmpdir
060114 205612 parsing 
file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-default.xml
060114 205612 parsing 
file:/home/guenter/workspace/lucene/nutch-nightly/conf/crawl-tool.xml
060114 205612 parsing 
file:/home/guenter/workspace/lucene/nutch-nightly/conf/mapred-default.xml
060114 205612 parsing 
file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-site.xml
060114 205612 crawl started in: tmpdir
060114 205612 rootUrlDir = urllist.txt
060114 205612 threads = 10
060114 205612 depth = 5
060114 205612 parsing 
file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-default.xml
060114 205612 parsing 
file:/home/guenter/workspace/lucene/nutch-nightly/conf/crawl-tool.xml
060114 205612 parsing 
file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-site.xml
060114 205612 Injector: starting
060114 205612 Injector: crawlDb: tmpdir/crawldb
060114 205612 Injector: urlDir: urllist.txt
060114 205612 Injector: Converting injected urls to crawl db entries.
060114 205612 parsing 
file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-default.xml
060114 205612 parsing 
file:/home/guenter/workspace/lucene/nutch-nightly/conf/crawl-tool.xml
060114 205612 parsing 
file:/home/guenter/workspace/lucene/nutch-nightly/conf/mapred-default.xml
060114 205612 parsing 
file:/home/guenter/workspace/lucene/nutch-nightly/conf/mapred-default.xml
060114 205612 parsing 
file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-site.xml
060114 205612 Running job: job_n0o7ps
060114 205612 parsing 
file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-default.xml
060114 205613 parsing 
file:/home/guenter/workspace/lucene/nutch-nightly/conf/mapred-default.xml
060114 205613 parsing /tmp/nutch/mapred/local/localRunner/job_n0o7ps.xml
060114 205613 parsing 
file:/home/guenter/workspace/lucene/nutch-nightly/conf/nutch-site.xml
java.io.IOException: No input directories specified in: NutchConf: 
nutch-default.xml , mapred-default.xml , 
/tmp/nutch/mapred/local/localRunner/job_n0o7ps.xml , nutch-site.xml
        at 
org.apache.nutch.mapred.InputFormatBase.listFiles(InputFormatBase.java:85)
        at 
org.apache.nutch.mapred.InputFormatBase.getSplits(InputFormatBase.java:95)
        at 
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:63)
060114 205613  map 0%
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:102)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)


urllist.txt contains
  http://www.mentor.ch

PS: Is there a committer or developer (near Switzerland) who can support (paid 
support) with a mixed index for intranet, some internet sites and scanning of 
local drives (P:\ , S:\ etc)

Re: [jira] Created: (NUTCH-175) No input directories specified in: while crawing in nightly build from the 14.1.2006: sh ./nutch crawl urllist.txt -dir tmpdir

Reply via email to