I've been using nutch to crawl a lot of news feeds and I had to modify my plugins file to handle a bunch of mime types. Not many sites follow the spec on what mime type to use.

My parse-plugins.xml file has mime-type mappings for all of these:

text/html
text/plain
text/rss
text/xml
application/xml
application/rss+xml
application/atom+xml
application/xhtml+xml
application/octet-stream

-dave

On Sep 27, 2008, at 6:44 AM, Chetan Patel wrote:


Hi,

I have got following message from log file while crawling xml url.

2008-09-27 16:06:20,920 WARN parse.ParserFactory - ParserFactory:Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via
parse-plugins.xml, but its plugin.xml file does not claim to support
contentType: text/xml

Please help me if you have any idea.

-Chetan



Chetan Patel wrote:

Hi,

Thanks for help.

I have already added this in plugin.includes.

and still getting only root url.

Regards,
Chetan Patel


Edward Quick wrote:


Chetan,

Try adding parse-rss in nutch-site.xml. Here's mine:

<property>
 <name>plugin.includes</name>

<value>protocol-httpclient|urlfilter-regex|parse-(text|html| msexcel|msword|mspowerpoint|pdf|zip|swf|rss)|index-basic|query- (basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass| regex|basic)</value>
 <description></description>
</property>


Ed.


Date: Sat, 27 Sep 2008 01:30:43 -0700
From: [EMAIL PROTECTED]
To: [email protected]
Subject: crawl xml url using nutch-0.9


Hi All,

I have tried to crawl xml url (http://sports.yahoo.com/nfl/rss.xml)
using
depth 2.

But it will crawl only root url.

Please help me how to crawl root url as well as all sub url of root url.

Thanks in advance.

Regads,
Chetan Patel
--
View this message in context:
http://www.nabble.com/crawl-xml-url-using-nutch-0.9-tp19700770p19700770.html
Sent from the Nutch - User mailing list archive at Nabble.com.


_________________________________________________________________
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/




--
View this message in context: 
http://www.nabble.com/crawl-xml-url-using-nutch-0.9-tp19700770p19701619.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Reply via email to