Hello,

I just found out that nutch-0.6 crawls the local filesystem without
errors!!!

In contrast, nutch-0.7 and -0.7.1 failed, with different errors but at the
same position. The current snapshot in subversion also didn't work for me,
with a quite different error. Below are the error logs.

In the meantime I detected the jar-file with the IndexingFilter class: it is
located in the nutch-root-directory. But in my tests, it was not necessary
to specify this jar-file in the CLASSPATH. That means nutch-0.6 worked well
without the jar-file specified in the CLASSPATH.

So it would be interesting to hear some developer on that issue. 

Regards,
Alfred


----------------------------------------------------------------------------
---------------------------------------------

nutch-0.7:

run java in c:\j2sdk1.4.2_04\jre
051219 041455 parsing file:/C:/nutch-0.7/conf/nutch-default.xml
051219 041455 parsing file:/C:/nutch-0.7/conf/crawl-tool.xml
051219 041455 parsing file:/C:/nutch-0.7/conf/nutch-site.xml
051219 041455 No FS indicated, using default:local
051219 041455 crawl started in: crawl.test
051219 041455 rootUrlFile = urls
051219 041455 threads = 10
051219 041455 depth = 3
051219 041455 Created webdb at LocalFS,C:\nutch-0.7\crawl.test\db
051219 041456 Starting URL processing
051219 041456 Plugins: looking in: C:\nutch-0.7\plugins
051219 041456 not including: C:\nutch-0.7\plugins\clustering-carrot2
051219 041456 not including: C:\nutch-0.7\plugins\creativecommons
051219 041456 parsing: C:\nutch-0.7\plugins\index-basic\plugin.xml
051219 041456 impl: point=org.apache.nutch.indexer.IndexingFilter
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
051219 041456 not including: C:\nutch-0.7\plugins\index-more
051219 041456 not including: C:\nutch-0.7\plugins\language-identifier
051219 041456 not including: C:\nutch-0.7\plugins\ontology
051219 041456 not including: C:\nutch-0.7\plugins\parse-ext
051219 041456 parsing: C:\nutch-0.7\plugins\parse-html\plugin.xml
051219 041456 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.html.HtmlParser
051219 041456 not including: C:\nutch-0.7\plugins\parse-js
051219 041456 not including: C:\nutch-0.7\plugins\parse-msword
051219 041456 not including: C:\nutch-0.7\plugins\parse-pdf
051219 041456 not including: C:\nutch-0.7\plugins\parse-rss
051219 041456 parsing: C:\nutch-0.7\plugins\parse-text\plugin.xml
051219 041456 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
051219 041456 parsing: C:\nutch-0.7\plugins\protocol-file\plugin.xml
051219 041456 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.file.File
051219 041456 not including: C:\nutch-0.7\plugins\protocol-ftp
051219 041456 parsing: C:\nutch-0.7\plugins\protocol-http\plugin.xml
051219 041456 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.http.Http
051219 041456 not including: C:\nutch-0.7\plugins\protocol-httpclient
051219 041456 parsing: C:\nutch-0.7\plugins\query-basic\plugin.xml
051219 041456 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.basic.BasicQueryFilter
051219 041456 not including: C:\nutch-0.7\plugins\query-more
051219 041456 parsing: C:\nutch-0.7\plugins\query-site\plugin.xml
051219 041456 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.site.SiteQueryFilter
051219 041456 parsing: C:\nutch-0.7\plugins\query-url\plugin.xml
051219 041456 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.url.URLQueryFilter
051219 041456 not including: C:\nutch-0.7\plugins\urlfilter-prefix
051219 041456 not including: C:\nutch-0.7\plugins\urlfilter-regex
java.lang.ExceptionInInitializerError
        at org.apache.nutch.db.WebDBInjector.addPage(WebDBInjector.java:437)
        at
org.apache.nutch.db.WebDBInjector.injectURLFile(WebDBInjector.java:378)
        at org.apache.nutch.db.WebDBInjector.main(WebDBInjector.java:535)
        at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:134)
Caused by: java.lang.RuntimeException: org.apache.nutch.net.URLFilter not
found.
        at org.apache.nutch.net.URLFilters.<clinit>(URLFilters.java:44)
        ... 4 more
Exception in thread "main" 


----------------------------------------------------------------------------
------------------------------------------------

nutch Rev. 357497:

051219 020610 parsing file:/C:/trunk/conf/nutch-default.xml
051219 020610 parsing file:/C:/trunk/conf/crawl-tool.xml
051219 020610 parsing file:/C:/trunk/conf/mapred-default.xml
051219 020610 parsing file:/C:/trunk/conf/nutch-site.xml
051219 020610 crawl started in: crawl.test
051219 020610 rootUrlDir = urls
051219 020610 threads = 10
051219 020610 depth = 3
051219 020610 parsing file:/C:/trunk/conf/nutch-default.xml
051219 020610 parsing file:/C:/trunk/conf/crawl-tool.xml
051219 020610 parsing file:/C:/trunk/conf/nutch-site.xml
051219 020610 Injector: starting
051219 020610 Injector: crawlDb: crawl.test\crawldb
051219 020610 Injector: urlDir: urls
051219 020610 Injector: Converting injected urls to crawl db entries.
051219 020610 parsing file:/C:/trunk/conf/nutch-default.xml
051219 020610 parsing file:/C:/trunk/conf/crawl-tool.xml
051219 020610 parsing file:/C:/trunk/conf/mapred-default.xml
051219 020610 parsing file:/C:/trunk/conf/mapred-default.xml
051219 020610 parsing file:/C:/trunk/conf/nutch-site.xml
051219 020611 Running job: job_8cjn0j
051219 020611 parsing file:/C:/trunk/conf/nutch-default.xml
051219 020611 parsing file:/C:/trunk/conf/mapred-default.xml
051219 020611 parsing \tmp\nutch\mapred\local\localRunner\job_8cjn0j.xml
051219 020611 parsing file:/C:/trunk/conf/nutch-site.xml
java.io.IOException: No input directories specified in: NutchConf:
nutch-default.xml , mapred-default.xml ,
\tmp\nutch\mapred\local\localRunner\job_8cjn0j.xml , nutch-site.xml
        at
org.apache.nutch.mapred.InputFormatBase.listFiles(InputFormatBase.java:85)
        at
org.apache.nutch.mapred.InputFormatBase.getSplits(InputFormatBase.java:95)
        at
org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:63)
051219 020612  map 0%
java.io.IOException: Job failed!
        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:102)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:101)
Exception in thread "main" 



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to