Hello,

I just found out that nutch-0.6 crawls the local filesystem without
errors!!!

In contrast, nutch-0.7 and -0.7.1 failed, with different errors but at the
same position. The current snapshot in subversion also didn't work for me,
with a quite different error. Below are the error logs.

In the meantime I detected the jar-file with the IndexingFilter class: it is
located in the nutch-root-directory. But in my tests, it was not necessary
to specify this jar-file in the CLASSPATH. That means nutch-0.6 worked well
without the jar-file specified in the CLASSPATH.

So it would be interesting to hear some developer on that issue. 

Regards,
Alfred


----------------------------------------------------------------------------
---------------------------------------------

nutch-0.7:

run java in c:\j2sdk1.4.2_04\jre
051219 041455 parsing file:/C:/nutch-0.7/conf/nutch-default.xml
051219 041455 parsing file:/C:/nutch-0.7/conf/crawl-tool.xml
051219 041455 parsing file:/C:/nutch-0.7/conf/nutch-site.xml
051219 041455 No FS indicated, using default:local
051219 041455 crawl started in: crawl.test
051219 041455 rootUrlFile = urls
051219 041455 threads = 10
051219 041455 depth = 3
051219 041455 Created webdb at LocalFS,C:\nutch-0.7\crawl.test\db
051219 041456 Starting URL processing
051219 041456 Plugins: looking in: C:\nutch-0.7\plugins
051219 041456 not including: C:\nutch-0.7\plugins\clustering-carrot2
051219 041456 not including: C:\nutch-0.7\plugins\creativecommons
051219 041456 parsing: C:\nutch-0.7\plugins\index-basic\plugin.xml
051219 041456 impl: point=org.apache.nutch.indexer.IndexingFilter
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
051219 041456 not including: C:\nutch-0.7\plugins\index-more
051219 041456 not including: C:\nutch-0.7\plugins\language-identifier
051219 041456 not including: C:\nutch-0.7\plugins\ontology
051219 041456 not including: C:\nutch-0.7\plugins\parse-ext
051219 041456 parsing: C:\nutch-0.7\plugins\parse-html\plugin.xml
051219 041456 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.html.HtmlParser
051219 041456 not including: C:\nutch-0.7\plugins\parse-js
051219 041456 not including: C:\nutch-0.7\plugins\parse-msword
051219 041456 not including: C:\nutch-0.7\plugins\parse-pdf
051219 041456 not including: C:\nutch-0.7\plugins\parse-rss
051219 041456 parsing: C:\nutch-0.7\plugins\parse-text\plugin.xml
051219 041456 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
051219 041456 parsing: C:\nutch-0.7\plugins\protocol-file\plugin.xml
051219 041456 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.file.File
051219 041456 not including: C:\nutch-0.7\plugins\protocol-ftp
051219 041456 parsing: C:\nutch-0.7\plugins\protocol-http\plugin.xml
051219 041456 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.http.Http
051219 041456 not including: C:\nutch-0.7\plugins\protocol-httpclient
051219 041456 parsing: C:\nutch-0.7\plugins\query-basic\plugin.xml
051219 041456 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.basic.BasicQueryFilter
051219 041456 not including: C:\nutch-0.7\plugins\query-more
051219 041456 parsing: C:\nutch-0.7\plugins\query-site\plugin.xml
051219 041456 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.site.SiteQueryFilter
051219 041456 parsing: C:\nutch-0.7\plugins\query-url\plugin.xml
051219 041456 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.url.URLQueryFilter
051219 041456 not including: C:\nutch-0.7\plugins\urlfilter-prefix
051219 041456 not including: C:\nutch-0.7\plugins\urlfilter-regex
java.lang.ExceptionInInitializerError
        at org.apache.nutch.db.WebDBInjector.addPage(WebDBInjector.java:437)
        at
org.apache.nutch.db.WebDBInjector.injectURLFile(WebDBInjector.java:378)
        at org.apache.nutch.db.WebDBInjector.main(WebDBInjector.java:535)
        at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:134)
Caused by: java.lang.RuntimeException: org.apache.nutch.net.URLFilter not
found.
        at org.apache.nutch.net.URLFilters.<clinit>(URLFilters.java:44)
        ... 4 more
Exception in thread "main" 


----------------------------------------------------------------------------
------------------------------------------------

nutch Rev. 357497:

051219 020610 parsing file:/C:/trunk/conf/nutch-default.xml
051219 020610 parsing file:/C:/trunk/conf/crawl-tool.xml
051219 020610 parsing file:/C:/trunk/conf/mapred-default.xml
051219 020610 parsing file:/C:/trunk/conf/nutch-site.xml
051219 020610 crawl started in: crawl.test
051219 020610 rootUrlDir = urls
051219 020610 threads = 10
051219 020610 depth = 3
051219 020610 parsing file:/C:/trunk/conf/nutch-default.xml
051219 020610 parsing file:/C:/trunk/conf/crawl-tool.xml
051219 020610 parsing file:/C:/trunk/conf/nutch-site.xml
051219 020610 Injector: starting
051219 020610 Injector: crawlDb: crawl.test\crawldb
051219 020610 Injector: urlDir: urls
051219 020610 Injector: Converting injected urls to crawl db entries.
051219 020610 parsing file:/C:/trunk/conf/nutch-default.xml
051219 020610 parsing file:/C:/trunk/conf/crawl-tool.xml
051219 020610 parsing file:/C:/trunk/conf/mapred-default.xml
051219 020610 parsing file:/C:/trunk/conf/mapred-default.xml
051219 020610 parsing file:/C:/trunk/conf/nutch-site.xml
051219 020611 Running job: job_8cjn0j
051219 020611 parsing file:/C:/trunk/conf/nutch-default.xml
051219 020611 parsing file:/C:/trunk/conf/mapred-default.xml
051219 020611 parsing \tmp\nutch\mapred\local\localRunner\job_8cjn0j.xml
051219 020611 parsing file:/C:/trunk/conf/nutch-site.xml
java.io.IOException: No input directories specified in: NutchConf:
nutch-default.xml , mapred-default.xml ,
\tmp\nutch\mapred\local\localRunner\job_8cjn0j.xml , nutch-site.xml
        at
org.apache.nutch.mapred.InputFormatBase.listFiles(InputFormatBase.java:85)
        at
org.apache.nutch.mapred.InputFormatBase.getSplits(InputFormatBase.java:95)
        at
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:63)
051219 020612  map 0%
java.io.IOException: Job failed!
        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:102)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:101)
Exception in thread "main" 

Reply via email to