Chris Mattmann wrote:
 There are currently 2 plugins that parse feeds and get them indexed:

 parse-rss - older, but gets the job done
 feed - newer, and takes advantage of the ability to parse/index feeds in
one step, rather than in many

I didn't realise this as I was using 0.9 where only parse-rss is included. I'm now following svn trunk, and have crawled and indexed some feeds.

 There are other idiosyncrasies about each of these plugins so feel free to
ask specific questions to the main developers of each of them. The parse-rss
plugin was primarily developed by me, and the feed plugin was primarily
developed by Doğacan Güney, another Nutch committer like myself.

I've tried both the plugins individually with Nutch, on the same crawl list, and though obviously a subjective measure, I've not noticed much difference, with both plugins failing on the feedburner feed:

- http://feeds.feedburner.com/Techcrunch

With errors:

failed(2,200): com.sun.syndication.io.ParsingFeedException: Invalid XML: Error on line 411: XML document structures must start and end within the same entity.

and:

failed(2,0): Can't be handled as rss document. org.apache.commons.feedparser.FeedParserException: org.jdom.input.JDOMParseException: Error on line 411: XML document structures must start and end within the same entity.

I appreciate that the XML for these feeds is likely invalid.

 As for the error that you're getting below, it's due to the fact that Nutch
can't reliable differentiate between the mime type of different XML content.
So, to Nutch, even though it's a .rss file, its mime type is
application/xml. Because the mime type, though a true mime type of the file,
is not the preferred mime type (application/rss+xml, or the like), Nutch has
trouble finding the appropriate parser to parse the content. For instance,
according to parse-plugins.xml (a file in your $NUTCH_HOME/conf directory),
the parse-rss plugin and the feed plugin are registered to parse
application/rss+xml, but not application/xml.

The current trunk version of Nutch recently had a fix committed for this
very issue (http://issues.apache.org/jira/browse/NUTCH-562).

Thanks for this, I've just started using svn trunk. Though after trying it I was getting mime errors for application/atom+xml. Presumably I can fix this by registering the following in parse-plugins.xml?

        <mimeType name="application/atom+xml">
          <plugin id="parse-rss"/>
          <plugin id="feed"/>
        </mimeType>

Another problem I seem to have just now is that some of the search results link to their XML feeds, rather than to the destination of their items. Is there anyway to ensure that the results point to the content itself rather than the feed? Further to this the summaries often seem to contain clips of HTML, URLs, and HTML entities etc... I'm assuming this is a side effect of the feeds being indexed rather than their items, is this correct? What can be done about it?

I'm also getting to the stage where I want to programatically access Nutch from another Java application/process. Does Nutch support doing this in a client/server manner?


 If you have any more specific questions, I'd be happy to answer them.

Thanks, it's much appreciated.

R.

Reply via email to