I've been using nutch to crawl a lot of news feeds and I had to modify
my plugins file to handle a bunch of mime types. Not many sites
follow the spec on what mime type to use.
My parse-plugins.xml file has mime-type mappings for all of these:
text/html
text/plain
text/rss
text/xml
application/xml
application/rss+xml
application/atom+xml
application/xhtml+xml
application/octet-stream
-dave
On Sep 27, 2008, at 6:44 AM, Chetan Patel wrote:
Hi,
I have got following message from log file while crawling xml url.
2008-09-27 16:06:20,920 WARN parse.ParserFactory -
ParserFactory:Plugin:
org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml
via
parse-plugins.xml, but its plugin.xml file does not claim to support
contentType: text/xml
Please help me if you have any idea.
-Chetan
Chetan Patel wrote:
Hi,
Thanks for help.
I have already added this in plugin.includes.
and still getting only root url.
Regards,
Chetan Patel
Edward Quick wrote:
Chetan,
Try adding parse-rss in nutch-site.xml. Here's mine:
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|
msexcel|msword|mspowerpoint|pdf|zip|swf|rss)|index-basic|query-
(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|
regex|basic)</value>
<description></description>
</property>
Ed.
Date: Sat, 27 Sep 2008 01:30:43 -0700
From: [EMAIL PROTECTED]
To: [email protected]
Subject: crawl xml url using nutch-0.9
Hi All,
I have tried to crawl xml url (http://sports.yahoo.com/nfl/rss.xml)
using
depth 2.
But it will crawl only root url.
Please help me how to crawl root url as well as all sub url of
root url.
Thanks in advance.
Regads,
Chetan Patel
--
View this message in context:
http://www.nabble.com/crawl-xml-url-using-nutch-0.9-tp19700770p19700770.html
Sent from the Nutch - User mailing list archive at Nabble.com.
_________________________________________________________________
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/
--
View this message in context:
http://www.nabble.com/crawl-xml-url-using-nutch-0.9-tp19700770p19701619.html
Sent from the Nutch - User mailing list archive at Nabble.com.