Hi,
I'm trying to set up Nutch to crawl blogs.

For nutch-site.xml, I added parse-rss to plugin.includes:
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|rss)|index-more|query-(basic|site|url)</value>


and set db.ignore.internal.links to false.

I noticed that in parse-plugins.xml:

        <mimeType name="text/xml">
                <plugin id="parse-text" />
                <plugin id="parse-html" />
                <plugin id="parse-rss" />
        </mimeType>

is this by order of priority, and parse-rss is last?

I tried injecting a single URL, my blog feed which is text/xml:
http://migs.paraz.com/w/feed/

It apparently isn't parsed.

Thanks in advance.

Reply via email to