Hi, I have got following message from log file while crawling xml url.
2008-09-27 16:06:20,920 WARN parse.ParserFactory - ParserFactory:Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: text/xml Please help me if you have any idea. -Chetan Chetan Patel wrote: > > Hi, > > Thanks for help. > > I have already added this in plugin.includes. > > and still getting only root url. > > Regards, > Chetan Patel > > > Edward Quick wrote: >> >> >> Chetan, >> >> Try adding parse-rss in nutch-site.xml. Here's mine: >> >> <property> >> <name>plugin.includes</name> >> >> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf|zip|swf|rss)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> >> <description></description> >> </property> >> >> >> Ed. >> >> >>> Date: Sat, 27 Sep 2008 01:30:43 -0700 >>> From: [EMAIL PROTECTED] >>> To: [email protected] >>> Subject: crawl xml url using nutch-0.9 >>> >>> >>> Hi All, >>> >>> I have tried to crawl xml url (http://sports.yahoo.com/nfl/rss.xml) >>> using >>> depth 2. >>> >>> But it will crawl only root url. >>> >>> Please help me how to crawl root url as well as all sub url of root url. >>> >>> Thanks in advance. >>> >>> Regads, >>> Chetan Patel >>> -- >>> View this message in context: >>> http://www.nabble.com/crawl-xml-url-using-nutch-0.9-tp19700770p19700770.html >>> Sent from the Nutch - User mailing list archive at Nabble.com. >>> >> >> _________________________________________________________________ >> Get all your favourite content with the slick new MSN Toolbar - FREE >> http://clk.atdmt.com/UKM/go/111354027/direct/01/ >> > > -- View this message in context: http://www.nabble.com/crawl-xml-url-using-nutch-0.9-tp19700770p19701619.html Sent from the Nutch - User mailing list archive at Nabble.com.
