Hi,
I'm trying to set up Nutch to crawl blogs.
For nutch-site.xml, I added parse-rss to plugin.includes:
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|rss)|index-more|query-(basic|site|url)</value>
and set db.ignore.internal.links to false.
I noticed that in parse-plugins.xml:
<mimeType name="text/xml">
<plugin id="parse-text" />
<plugin id="parse-html" />
<plugin id="parse-rss" />
</mimeType>
is this by order of priority, and parse-rss is last?
I tried injecting a single URL, my blog feed which is text/xml:
http://migs.paraz.com/w/feed/
It apparently isn't parsed.
Thanks in advance.