Re: Indexing Feeds & Blog Posts with Nutch

Rick Moynihan Fri, 12 Oct 2007 09:12:10 -0700

Chris Mattmann wrote:

 There are currently 2 plugins that parse feeds and get them indexed:


 parse-rss - older, but gets the job done
 feed - newer, and takes advantage of the ability to parse/index feeds in
one step, rather than in many

I didn't realise this as I was using 0.9 where only parse-rss isincluded. I'm now following svn trunk, and have crawled and indexedsome feeds.

 There are other idiosyncrasies about each of these plugins so feel free to
ask specific questions to the main developers of each of them. The parse-rss
plugin was primarily developed by me, and the feed plugin was primarily
developed by Doğacan Güney, another Nutch committer like myself.

I've tried both the plugins individually with Nutch, on the same crawllist, and though obviously a subjective measure, I've not noticed muchdifference, with both plugins failing on the feedburner feed:


- http://feeds.feedburner.com/Techcrunch

With errors:

failed(2,200): com.sun.syndication.io.ParsingFeedException: Invalid XML:Error on line 411: XML document structures must start and end within thesame entity.


and:

failed(2,0): Can't be handled as rss document.org.apache.commons.feedparser.FeedParserException:org.jdom.input.JDOMParseException: Error on line 411: XML documentstructures must start and end within the same entity.


I appreciate that the XML for these feeds is likely invalid.

 As for the error that you're getting below, it's due to the fact that Nutch
can't reliable differentiate between the mime type of different XML content.
So, to Nutch, even though it's a .rss file, its mime type is
application/xml. Because the mime type, though a true mime type of the file,
is not the preferred mime type (application/rss+xml, or the like), Nutch has
trouble finding the appropriate parser to parse the content. For instance,
according to parse-plugins.xml (a file in your $NUTCH_HOME/conf directory),
the parse-rss plugin and the feed plugin are registered to parse
application/rss+xml, but not application/xml.

The current trunk version of Nutch recently had a fix committed for this
very issue (http://issues.apache.org/jira/browse/NUTCH-562).

Thanks for this, I've just started using svn trunk. Though after tryingit I was getting mime errors for application/atom+xml. Presumably I canfix this by registering the following in parse-plugins.xml?


        <mimeType name="application/atom+xml">
          <plugin id="parse-rss"/>
          <plugin id="feed"/>
        </mimeType>

Another problem I seem to have just now is that some of the searchresults link to their XML feeds, rather than to the destination of theiritems. Is there anyway to ensure that the results point to the contentitself rather than the feed? Further to this the summaries often seemto contain clips of HTML, URLs, and HTML entities etc... I'm assumingthis is a side effect of the feeds being indexed rather than theiritems, is this correct? What can be done about it?

I'm also getting to the stage where I want to programatically accessNutch from another Java application/process. Does Nutch support doingthis in a client/server manner?

 If you have any more specific questions, I'd be happy to answer them.


Thanks, it's much appreciated.

R.

Re: Indexing Feeds & Blog Posts with Nutch

Reply via email to