Indexing Feeds & Blog Posts with Nutch

Rick Moynihan Thu, 11 Oct 2007 09:19:27 -0700

Hi all,

I've recently downloaded Nutch v0.9, to experiment in searching blogposts and RSS/Atom feeds. So far I have managed to get it tosuccessfully crawl, index and search some websites.

I am now starting my investigations to use Nutch to crawl/index/searchnews/blog feeds. And have included the parse-rss plugin which appearsto ship in the plugins/ directory by pasting the following into mynutch-site.xml file:


<property>
  <name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(rss|text|html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

However some feeds appear to return the following error (apparentlybecause they are being returned with a mime-type of application/xml.

Error parsing: http://example.com/.rss: failed(2,200):org.apache.nutch.parse.ParseException: parser not found forcontentType=application/xml url=http://example.com/.rss

It also appears when searching that the returned results point to thematching feed rather than the matching item. Is there a way aroundthis? Or am I best parsing out the item urls (e.g. via a shell script)somehow adding them to the crawlist and indexing the HTML as normal?

Also, if anyone is using Nutch to index blogs/feeds, then I'd beinterested in how you have it configured.


Thanks again,

--
Rick Moynihan
Software Engineer
Calico Jack LTD
http://www.calicojack.co.uk/

Indexing Feeds & Blog Posts with Nutch

Reply via email to