Re: Indexing Feeds & Blog Posts with Nutch

Chris Mattmann Mon, 15 Oct 2007 08:07:27 -0700

Hi Brian,

 Sorry for taking so long to reply. Here ya go:


> Do you have any URLs for feeds that are reliably parsed and indexed by
> the feed parser? 

 I haven't tested/used this plugin in a quite a while. There was someone on
the nutch-user list before, nutch.newbie, that was doing quite a bit of feed
parsing. Nutch.newbie, if you're still around, could you send Brian a list
of feeds that you were testing on?

> Does it actually index atom at present?  There's
> something in the code that looks for application/rss+xml as the mime
> type.

 AFAIK, the plugin does in fact index atom. The plugin itself is built on
top of the underlying ROME toolkit if I remember correctly.

HTH,
  Chris

> 
> Brian Ulicny
> 
> 
> On Thu, 11 Oct 2007 15:23:04 -0700, "Chris Mattmann"
> <[EMAIL PROTECTED]> said:
>> Hi Rick,
>> 
>>  Glad to hear that you're interested in using Nutch!
>> 
>>  There are currently 2 plugins that parse feeds and get them indexed:
>> 
>>  parse-rss - older, but gets the job done
>>  feed - newer, and takes advantage of the ability to parse/index feeds in
>> one step, rather than in many
>> 
>>  There are other idiosyncrasies about each of these plugins so feel free
>>  to
>> ask specific questions to the main developers of each of them. The
>> parse-rss
>> plugin was primarily developed by me, and the feed plugin was primarily
>> developed by Do&#287;acan Güney, another Nutch committer like myself.
>> 
>>  As for the error that you're getting below, it's due to the fact that
>>  Nutch
>> can't reliable differentiate between the mime type of different XML
>> content.
>> So, to Nutch, even though it's a .rss file, its mime type is
>> application/xml. Because the mime type, though a true mime type of the
>> file,
>> is not the preferred mime type (application/rss+xml, or the like), Nutch
>> has
>> trouble finding the appropriate parser to parse the content. For
>> instance,
>> according to parse-plugins.xml (a file in your $NUTCH_HOME/conf
>> directory),
>> the parse-rss plugin and the feed plugin are registered to parse
>> application/rss+xml, but not application/xml.
>> 
>> The current trunk version of Nutch recently had a fix committed for this
>> very issue (http://issues.apache.org/jira/browse/NUTCH-562).
>> 
>>  If you have any more specific questions, I'd be happy to answer them.
>> 
>> Thanks!
>> 
>> Cheers,
>>   Chris
>> 
>> 
>> 
>> On 10/11/07 9:14 AM, "Rick Moynihan" <[EMAIL PROTECTED]> wrote:
>> 
>>> Hi all,
>>> 
>>> I've recently downloaded Nutch v0.9, to experiment in searching blog
>>> posts and RSS/Atom feeds.  So far I have managed to get it to
>>> successfully crawl, index and search some websites.
>>> 
>>> I am now starting my investigations to use Nutch to crawl/index/search
>>> news/blog feeds.  And have included the parse-rss plugin which appears
>>> to ship in the plugins/ directory by pasting the following into my
>>> nutch-site.xml file:
>>> 
>>> <property>
>>>    <name>plugin.includes</name>
>>> <value>protocol-http|urlfilter-regex|parse-(rss|text|html)|index-basic|query
>>> -(
>>> basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)<
>>> /v
>>> alue>
>>> </property>
>>> 
>>> However some feeds appear to return the following error (apparently
>>> because they are being returned with a mime-type of application/xml.
>>> 
>>> Error parsing: http://example.com/.rss: failed(2,200):
>>> org.apache.nutch.parse.ParseException: parser not found for
>>> contentType=application/xml url=http://example.com/.rss
>>> 
>>> It also appears when searching that the returned results point to the
>>> matching feed rather than the matching item.  Is there a way around
>>> this?  Or am I best parsing out the item urls (e.g. via a shell script)
>>> somehow adding them to the crawlist and indexing the HTML as normal?
>>> 
>>> Also, if anyone is using Nutch to index blogs/feeds, then I'd be
>>> interested in how you have it configured.
>>> 
>>> Thanks again,
>> 
>> ______________________________________________
>> Chris Mattmann, Ph.D.
>> [EMAIL PROTECTED]
>> Cognizant Development Engineer
>> Early Detection Research Network Project
>> 
>> _________________________________________________
>> Jet Propulsion Laboratory            Pasadena, CA
>> Office: 171-266B                     Mailstop:  171-246
>> _______________________________________________________
>> 
>> Disclaimer:  The opinions presented within are my own and do not reflect
>> those of either NASA, JPL, or the California Institute of Technology.
>> 
>> 

______________________________________________
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Indexing Feeds & Blog Posts with Nutch

Reply via email to