Hi Brian, Sorry for taking so long to reply. Here ya go:
> Do you have any URLs for feeds that are reliably parsed and indexed by > the feed parser? I haven't tested/used this plugin in a quite a while. There was someone on the nutch-user list before, nutch.newbie, that was doing quite a bit of feed parsing. Nutch.newbie, if you're still around, could you send Brian a list of feeds that you were testing on? > Does it actually index atom at present? There's > something in the code that looks for application/rss+xml as the mime > type. AFAIK, the plugin does in fact index atom. The plugin itself is built on top of the underlying ROME toolkit if I remember correctly. HTH, Chris > > Brian Ulicny > > > On Thu, 11 Oct 2007 15:23:04 -0700, "Chris Mattmann" > <[EMAIL PROTECTED]> said: >> Hi Rick, >> >> Glad to hear that you're interested in using Nutch! >> >> There are currently 2 plugins that parse feeds and get them indexed: >> >> parse-rss - older, but gets the job done >> feed - newer, and takes advantage of the ability to parse/index feeds in >> one step, rather than in many >> >> There are other idiosyncrasies about each of these plugins so feel free >> to >> ask specific questions to the main developers of each of them. The >> parse-rss >> plugin was primarily developed by me, and the feed plugin was primarily >> developed by Doğacan Güney, another Nutch committer like myself. >> >> As for the error that you're getting below, it's due to the fact that >> Nutch >> can't reliable differentiate between the mime type of different XML >> content. >> So, to Nutch, even though it's a .rss file, its mime type is >> application/xml. Because the mime type, though a true mime type of the >> file, >> is not the preferred mime type (application/rss+xml, or the like), Nutch >> has >> trouble finding the appropriate parser to parse the content. For >> instance, >> according to parse-plugins.xml (a file in your $NUTCH_HOME/conf >> directory), >> the parse-rss plugin and the feed plugin are registered to parse >> application/rss+xml, but not application/xml. >> >> The current trunk version of Nutch recently had a fix committed for this >> very issue (http://issues.apache.org/jira/browse/NUTCH-562). >> >> If you have any more specific questions, I'd be happy to answer them. >> >> Thanks! >> >> Cheers, >> Chris >> >> >> >> On 10/11/07 9:14 AM, "Rick Moynihan" <[EMAIL PROTECTED]> wrote: >> >>> Hi all, >>> >>> I've recently downloaded Nutch v0.9, to experiment in searching blog >>> posts and RSS/Atom feeds. So far I have managed to get it to >>> successfully crawl, index and search some websites. >>> >>> I am now starting my investigations to use Nutch to crawl/index/search >>> news/blog feeds. And have included the parse-rss plugin which appears >>> to ship in the plugins/ directory by pasting the following into my >>> nutch-site.xml file: >>> >>> <property> >>> <name>plugin.includes</name> >>> <value>protocol-http|urlfilter-regex|parse-(rss|text|html)|index-basic|query >>> -( >>> basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)< >>> /v >>> alue> >>> </property> >>> >>> However some feeds appear to return the following error (apparently >>> because they are being returned with a mime-type of application/xml. >>> >>> Error parsing: http://example.com/.rss: failed(2,200): >>> org.apache.nutch.parse.ParseException: parser not found for >>> contentType=application/xml url=http://example.com/.rss >>> >>> It also appears when searching that the returned results point to the >>> matching feed rather than the matching item. Is there a way around >>> this? Or am I best parsing out the item urls (e.g. via a shell script) >>> somehow adding them to the crawlist and indexing the HTML as normal? >>> >>> Also, if anyone is using Nutch to index blogs/feeds, then I'd be >>> interested in how you have it configured. >>> >>> Thanks again, >> >> ______________________________________________ >> Chris Mattmann, Ph.D. >> [EMAIL PROTECTED] >> Cognizant Development Engineer >> Early Detection Research Network Project >> >> _________________________________________________ >> Jet Propulsion Laboratory Pasadena, CA >> Office: 171-266B Mailstop: 171-246 >> _______________________________________________________ >> >> Disclaimer: The opinions presented within are my own and do not reflect >> those of either NASA, JPL, or the California Institute of Technology. >> >> ______________________________________________ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _________________________________________________ Jet Propulsion Laboratory Pasadena, CA Office: 171-266B Mailstop: 171-246 _______________________________________________________ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
