Re: Indexing Feeds & Blog Posts with Nutch

Chris Mattmann Thu, 11 Oct 2007 15:24:49 -0700

Hi Rick,

 Glad to hear that you're interested in using Nutch!

 There are currently 2 plugins that parse feeds and get them indexed:

 parse-rss - older, but gets the job done
 feed - newer, and takes advantage of the ability to parse/index feeds in
one step, rather than in many

 There are other idiosyncrasies about each of these plugins so feel free to
ask specific questions to the main developers of each of them. The parse-rss
plugin was primarily developed by me, and the feed plugin was primarily
developed by Doğacan Güney, another Nutch committer like myself.

 As for the error that you're getting below, it's due to the fact that Nutch
can't reliable differentiate between the mime type of different XML content.
So, to Nutch, even though it's a .rss file, its mime type is
application/xml. Because the mime type, though a true mime type of the file,
is not the preferred mime type (application/rss+xml, or the like), Nutch has
trouble finding the appropriate parser to parse the content. For instance,
according to parse-plugins.xml (a file in your $NUTCH_HOME/conf directory),
the parse-rss plugin and the feed plugin are registered to parse
application/rss+xml, but not application/xml.

The current trunk version of Nutch recently had a fix committed for this
very issue (http://issues.apache.org/jira/browse/NUTCH-562).

 If you have any more specific questions, I'd be happy to answer them.

Thanks!

Cheers,
  Chris

On 10/11/07 9:14 AM, "Rick Moynihan" <[EMAIL PROTECTED]> wrote:

> Hi all,
> 
> I've recently downloaded Nutch v0.9, to experiment in searching blog
> posts and RSS/Atom feeds.  So far I have managed to get it to
> successfully crawl, index and search some websites.
> 
> I am now starting my investigations to use Nutch to crawl/index/search
> news/blog feeds.  And have included the parse-rss plugin which appears
> to ship in the plugins/ directory by pasting the following into my
> nutch-site.xml file:
> 
> <property>
>    <name>plugin.includes</name>
> <value>protocol-http|urlfilter-regex|parse-(rss|text|html)|index-basic|query-(
> basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</v
> alue>
> </property>
> 
> However some feeds appear to return the following error (apparently
> because they are being returned with a mime-type of application/xml.
> 
> Error parsing: http://example.com/.rss: failed(2,200):
> org.apache.nutch.parse.ParseException: parser not found for
> contentType=application/xml url=http://example.com/.rss
> 
> It also appears when searching that the returned results point to the
> matching feed rather than the matching item.  Is there a way around
> this?  Or am I best parsing out the item urls (e.g. via a shell script)
> somehow adding them to the crawlist and indexing the HTML as normal?
> 
> Also, if anyone is using Nutch to index blogs/feeds, then I'd be
> interested in how you have it configured.
> 
> Thanks again,

______________________________________________
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Indexing Feeds & Blog Posts with Nutch

Reply via email to