Hi all,
I've recently downloaded Nutch v0.9, to experiment in searching blog
posts and RSS/Atom feeds. So far I have managed to get it to
successfully crawl, index and search some websites.
I am now starting my investigations to use Nutch to crawl/index/search
news/blog feeds. And have included the parse-rss plugin which appears
to ship in the plugins/ directory by pasting the following into my
nutch-site.xml file:
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(rss|text|html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
However some feeds appear to return the following error (apparently
because they are being returned with a mime-type of application/xml.
Error parsing: http://example.com/.rss: failed(2,200):
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/xml url=http://example.com/.rss
It also appears when searching that the returned results point to the
matching feed rather than the matching item. Is there a way around
this? Or am I best parsing out the item urls (e.g. via a shell script)
somehow adding them to the crawlist and indexing the HTML as normal?
Also, if anyone is using Nutch to index blogs/feeds, then I'd be
interested in how you have it configured.
Thanks again,
--
Rick Moynihan
Software Engineer
Calico Jack LTD
http://www.calicojack.co.uk/