Hi Rick, Glad to hear that you're interested in using Nutch!
There are currently 2 plugins that parse feeds and get them indexed: parse-rss - older, but gets the job done feed - newer, and takes advantage of the ability to parse/index feeds in one step, rather than in many There are other idiosyncrasies about each of these plugins so feel free to ask specific questions to the main developers of each of them. The parse-rss plugin was primarily developed by me, and the feed plugin was primarily developed by Doğacan Güney, another Nutch committer like myself. As for the error that you're getting below, it's due to the fact that Nutch can't reliable differentiate between the mime type of different XML content. So, to Nutch, even though it's a .rss file, its mime type is application/xml. Because the mime type, though a true mime type of the file, is not the preferred mime type (application/rss+xml, or the like), Nutch has trouble finding the appropriate parser to parse the content. For instance, according to parse-plugins.xml (a file in your $NUTCH_HOME/conf directory), the parse-rss plugin and the feed plugin are registered to parse application/rss+xml, but not application/xml. The current trunk version of Nutch recently had a fix committed for this very issue (http://issues.apache.org/jira/browse/NUTCH-562). If you have any more specific questions, I'd be happy to answer them. Thanks! Cheers, Chris On 10/11/07 9:14 AM, "Rick Moynihan" <[EMAIL PROTECTED]> wrote: > Hi all, > > I've recently downloaded Nutch v0.9, to experiment in searching blog > posts and RSS/Atom feeds. So far I have managed to get it to > successfully crawl, index and search some websites. > > I am now starting my investigations to use Nutch to crawl/index/search > news/blog feeds. And have included the parse-rss plugin which appears > to ship in the plugins/ directory by pasting the following into my > nutch-site.xml file: > > <property> > <name>plugin.includes</name> > <value>protocol-http|urlfilter-regex|parse-(rss|text|html)|index-basic|query-( > basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</v > alue> > </property> > > However some feeds appear to return the following error (apparently > because they are being returned with a mime-type of application/xml. > > Error parsing: http://example.com/.rss: failed(2,200): > org.apache.nutch.parse.ParseException: parser not found for > contentType=application/xml url=http://example.com/.rss > > It also appears when searching that the returned results point to the > matching feed rather than the matching item. Is there a way around > this? Or am I best parsing out the item urls (e.g. via a shell script) > somehow adding them to the crawlist and indexing the HTML as normal? > > Also, if anyone is using Nutch to index blogs/feeds, then I'd be > interested in how you have it configured. > > Thanks again, ______________________________________________ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _________________________________________________ Jet Propulsion Laboratory Pasadena, CA Office: 171-266B Mailstop: 171-246 _______________________________________________________ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
