Request for info regarding filesystem based index.

Mike Reynols Wed, 09 Nov 2005 10:21:18 -0800

Here's the problem:

I need to get the Nutch engine running on a collection of xml documents thatI have (containing news stories). The files are named in the followingmanner:


example.xml.52908
example.xml.52909
example.xml.52910
example.xml.52911
...
example.xml.53365
example.xml.53366

Each xml file contains no html, just xml nodes (tags) and text. I have thesefiles (500 to start off with) all listed in my 'urls' file. I have followedthese steps(http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6),but to no solution. I'm wondering if I'm missing something.

When I run the crawl after these three modifications, I get the followingerror:


[EMAIL PROTECTED] nutch-0.7]# bin/nutch crawl urls -dir crawl.test -depth 3
051107 234038 parsing file:/root/Downloads/nutch-0.7/conf/nutch-default.xml
051107 234039 parsing file:/root/Downloads/nutch-0.7/conf/crawl-tool.xml
051107 234039 parsing file:/root/Downloads/nutch-0.7/conf/nutch-site.xml
051107 234039 No FS indicated, using default:local
051107 234039 crawl started in: crawl.test
051107 234039 rootUrlFile = urls
051107 234039 threads = 10
051107 234039 depth = 3

051107 234039 Created webdb atLocalFS,/root/Downloads/nutch-0.7/crawl.test/db

051107 234039 Starting URL processing
051107 234039 Plugins: looking in: /root/Downloads/nutch-0.7/plugins

051107 234039 not including:/root/Downloads/nutch-0.7/plugins/clustering-carrot2051107 234039 not including:/root/Downloads/nutch-0.7/plugins/creativecommons051107 234039 parsing:/root/Downloads/nutch-0.7/plugins/index-basic/plugin.xml051107 234039 impl: point=org.apache.nutch.indexer.IndexingFilterclass=org.apache.nutch.indexer.basic.BasicIndexingFilter

051107 234039 not including: /root/Downloads/nutch-0.7/plugins/index-more

051107 234039 not including:/root/Downloads/nutch-0.7/plugins/language-identifier

051107 234039 not including: /root/Downloads/nutch-0.7/plugins/ontology
051107 234039 not including: /root/Downloads/nutch-0.7/plugins/parse-ext

051107 234039 parsing:/root/Downloads/nutch-0.7/plugins/parse-html/plugin.xml051107 234040 impl: point=org.apache.nutch.parse.Parserclass=org.apache.nutch.parse.html.HtmlParser

051107 234040 not including: /root/Downloads/nutch-0.7/plugins/parse-js
051107 234040 not including: /root/Downloads/nutch-0.7/plugins/parse-msword
051107 234040 not including: /root/Downloads/nutch-0.7/plugins/parse-pdf
051107 234040 not including: /root/Downloads/nutch-0.7/plugins/parse-rss

051107 234040 parsing:/root/Downloads/nutch-0.7/plugins/parse-text/plugin.xml051107 234040 impl: point=org.apache.nutch.parse.Parserclass=org.apache.nutch.parse.text.TextParser051107 234040 parsing:/root/Downloads/nutch-0.7/plugins/protocol-file/plugin.xml051107 234040 impl: point=org.apache.nutch.protocol.Protocolclass=org.apache.nutch.protocol.file.File

051107 234040 not including: /root/Downloads/nutch-0.7/plugins/protocol-ftp

051107 234040 parsing:/root/Downloads/nutch-0.7/plugins/protocol-http/plugin.xml051107 234040 impl: point=org.apache.nutch.protocol.Protocolclass=org.apache.nutch.protocol.http.Http051107 234040 not including:/root/Downloads/nutch-0.7/plugins/protocol-httpclient051107 234040 parsing:/root/Downloads/nutch-0.7/plugins/query-basic/plugin.xml051107 234040 impl: point=org.apache.nutch.searcher.QueryFilterclass=org.apache.nutch.searcher.basic.BasicQueryFilter

051107 234040 not including: /root/Downloads/nutch-0.7/plugins/query-more

051107 234040 parsing:/root/Downloads/nutch-0.7/plugins/query-site/plugin.xml051107 234040 impl: point=org.apache.nutch.searcher.QueryFilterclass=org.apache.nutch.searcher.site.SiteQueryFilter051107 234040 parsing:/root/Downloads/nutch-0.7/plugins/query-url/plugin.xml051107 234040 impl: point=org.apache.nutch.searcher.QueryFilterclass=org.apache.nutch.searcher.url.URLQueryFilter051107 234040 not including:/root/Downloads/nutch-0.7/plugins/urlfilter-prefix051107 234040 not including:/root/Downloads/nutch-0.7/plugins/urlfilter-regex

Exception in thread "main" java.lang.ExceptionInInitializerError
       at org.apache.nutch.db.WebDBInjector.addPage(WebDBInjector.java:437)

atorg.apache.nutch.db.WebDBInjector.injectURLFile(WebDBInjector.java:378)

       at org.apache.nutch.db.WebDBInjector.main(WebDBInjector.java:535)
       at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:134)

Caused by: java.lang.RuntimeException: org.apache.nutch.net.URLFilter notfound.

       at org.apache.nutch.net.URLFilters.<clinit>(URLFilters.java:44)
       ... 4 more
[EMAIL PROTECTED] nutch-0.7]#

Now when I remove the property that was recommended in the last step of theabove outlined process, I get the following reoccuring errors, but the crawlfinishes (Unlike the above run, which caused the crawl to abortprematurely):


051107 214422 fetching file:///root/Downloads/topix/example.xml.53324

051107 214422 fetch of file:///root/Downloads/topix/example.xml.53324 failedwith: org.apache.nutch.protocol.ProtocolNotFound: protocol not found forurl=file

051107 214422 fetching file:///root/Downloads/topix/example.xml.53077

051107 214422 fetch of file:///root/Downloads/topix/example.xml.53077 failedwith: org.apache.nutch.protocol.ProtocolNotFound: protocol not found forurl=file

051107 214422 fetching file:///root/Downloads/topix/example.xml.53376

051107 214422 fetch of file:///root/Downloads/topix/example.xml.53376 failedwith: org.apache.nutch.protocol.ProtocolNotFound: protocol not found forurl=file

So on and so forth, for 500 files. Now, the crawl actually finishes, butnothing as you can see was ever indexed or processed (call it what youwill).

Now, I have looked through the documentation a thousand times and this isholding me up now. If anyone here has had a similar problem or has asolution, please enlighten me. Thanks a ton guys :)


Tyler

_________________________________________________________________

Express yourself instantly with MSN Messenger! Download today - it's FREE!http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Request for info regarding filesystem based index.

Reply via email to