Here's the problem:

I need to get the Nutch engine running on a collection of xml documents that I have (containing news stories). The files are named in the following manner:

example.xml.52908
example.xml.52909
example.xml.52910
example.xml.52911
...
example.xml.53365
example.xml.53366

Each xml file contains no html, just xml nodes (tags) and text. I have these files (500 to start off with) all listed in my 'urls' file. I have followed these steps (http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6), but to no solution. I'm wondering if I'm missing something.

When I run the crawl after these three modifications, I get the following error:

[EMAIL PROTECTED] nutch-0.7]# bin/nutch crawl urls -dir crawl.test -depth 3
051107 234038 parsing file:/root/Downloads/nutch-0.7/conf/nutch-default.xml
051107 234039 parsing file:/root/Downloads/nutch-0.7/conf/crawl-tool.xml
051107 234039 parsing file:/root/Downloads/nutch-0.7/conf/nutch-site.xml
051107 234039 No FS indicated, using default:local
051107 234039 crawl started in: crawl.test
051107 234039 rootUrlFile = urls
051107 234039 threads = 10
051107 234039 depth = 3
051107 234039 Created webdb at LocalFS,/root/Downloads/nutch-0.7/crawl.test/db
051107 234039 Starting URL processing
051107 234039 Plugins: looking in: /root/Downloads/nutch-0.7/plugins
051107 234039 not including: /root/Downloads/nutch-0.7/plugins/clustering-carrot2 051107 234039 not including: /root/Downloads/nutch-0.7/plugins/creativecommons 051107 234039 parsing: /root/Downloads/nutch-0.7/plugins/index-basic/plugin.xml 051107 234039 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter
051107 234039 not including: /root/Downloads/nutch-0.7/plugins/index-more
051107 234039 not including: /root/Downloads/nutch-0.7/plugins/language-identifier
051107 234039 not including: /root/Downloads/nutch-0.7/plugins/ontology
051107 234039 not including: /root/Downloads/nutch-0.7/plugins/parse-ext
051107 234039 parsing: /root/Downloads/nutch-0.7/plugins/parse-html/plugin.xml 051107 234040 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser
051107 234040 not including: /root/Downloads/nutch-0.7/plugins/parse-js
051107 234040 not including: /root/Downloads/nutch-0.7/plugins/parse-msword
051107 234040 not including: /root/Downloads/nutch-0.7/plugins/parse-pdf
051107 234040 not including: /root/Downloads/nutch-0.7/plugins/parse-rss
051107 234040 parsing: /root/Downloads/nutch-0.7/plugins/parse-text/plugin.xml 051107 234040 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser 051107 234040 parsing: /root/Downloads/nutch-0.7/plugins/protocol-file/plugin.xml 051107 234040 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.file.File
051107 234040 not including: /root/Downloads/nutch-0.7/plugins/protocol-ftp
051107 234040 parsing: /root/Downloads/nutch-0.7/plugins/protocol-http/plugin.xml 051107 234040 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http 051107 234040 not including: /root/Downloads/nutch-0.7/plugins/protocol-httpclient 051107 234040 parsing: /root/Downloads/nutch-0.7/plugins/query-basic/plugin.xml 051107 234040 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter
051107 234040 not including: /root/Downloads/nutch-0.7/plugins/query-more
051107 234040 parsing: /root/Downloads/nutch-0.7/plugins/query-site/plugin.xml 051107 234040 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter 051107 234040 parsing: /root/Downloads/nutch-0.7/plugins/query-url/plugin.xml 051107 234040 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter 051107 234040 not including: /root/Downloads/nutch-0.7/plugins/urlfilter-prefix 051107 234040 not including: /root/Downloads/nutch-0.7/plugins/urlfilter-regex
Exception in thread "main" java.lang.ExceptionInInitializerError
       at org.apache.nutch.db.WebDBInjector.addPage(WebDBInjector.java:437)
at org.apache.nutch.db.WebDBInjector.injectURLFile(WebDBInjector.java:378)
       at org.apache.nutch.db.WebDBInjector.main(WebDBInjector.java:535)
       at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:134)
Caused by: java.lang.RuntimeException: org.apache.nutch.net.URLFilter not found.
       at org.apache.nutch.net.URLFilters.<clinit>(URLFilters.java:44)
       ... 4 more
[EMAIL PROTECTED] nutch-0.7]#

Now when I remove the property that was recommended in the last step of the above outlined process, I get the following reoccuring errors, but the crawl finishes (Unlike the above run, which caused the crawl to abort prematurely):

051107 214422 fetching file:///root/Downloads/topix/example.xml.53324
051107 214422 fetch of file:///root/Downloads/topix/example.xml.53324 failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file
051107 214422 fetching file:///root/Downloads/topix/example.xml.53077
051107 214422 fetch of file:///root/Downloads/topix/example.xml.53077 failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file
051107 214422 fetching file:///root/Downloads/topix/example.xml.53376
051107 214422 fetch of file:///root/Downloads/topix/example.xml.53376 failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file

So on and so forth, for 500 files. Now, the crawl actually finishes, but nothing as you can see was ever indexed or processed (call it what you will).

Now, I have looked through the documentation a thousand times and this is holding me up now. If anyone here has had a similar problem or has a solution, please enlighten me. Thanks a ton guys :)

Tyler

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Reply via email to