Hi all,

I've recently downloaded Nutch v0.9, to experiment in searching blog posts and RSS/Atom feeds. So far I have managed to get it to successfully crawl, index and search some websites.

I am now starting my investigations to use Nutch to crawl/index/search news/blog feeds. And have included the parse-rss plugin which appears to ship in the plugins/ directory by pasting the following into my nutch-site.xml file:

<property>
  <name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(rss|text|html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

However some feeds appear to return the following error (apparently because they are being returned with a mime-type of application/xml.

Error parsing: http://example.com/.rss: failed(2,200): org.apache.nutch.parse.ParseException: parser not found for contentType=application/xml url=http://example.com/.rss

It also appears when searching that the returned results point to the matching feed rather than the matching item. Is there a way around this? Or am I best parsing out the item urls (e.g. via a shell script) somehow adding them to the crawlist and indexing the HTML as normal?

Also, if anyone is using Nutch to index blogs/feeds, then I'd be interested in how you have it configured.

Thanks again,

--
Rick Moynihan
Software Engineer
Calico Jack LTD
http://www.calicojack.co.uk/

Reply via email to