Chris Mattmann wrote:
There are currently 2 plugins that parse feeds and get them indexed:
parse-rss - older, but gets the job done
feed - newer, and takes advantage of the ability to parse/index feeds in
one step, rather than in many
I didn't realise this as I was using 0.9 where only parse-rss is
included. I'm now following svn trunk, and have crawled and indexed
some feeds.
There are other idiosyncrasies about each of these plugins so feel free to
ask specific questions to the main developers of each of them. The parse-rss
plugin was primarily developed by me, and the feed plugin was primarily
developed by Doğacan Güney, another Nutch committer like myself.
I've tried both the plugins individually with Nutch, on the same crawl
list, and though obviously a subjective measure, I've not noticed much
difference, with both plugins failing on the feedburner feed:
- http://feeds.feedburner.com/Techcrunch
With errors:
failed(2,200): com.sun.syndication.io.ParsingFeedException: Invalid XML:
Error on line 411: XML document structures must start and end within the
same entity.
and:
failed(2,0): Can't be handled as rss document.
org.apache.commons.feedparser.FeedParserException:
org.jdom.input.JDOMParseException: Error on line 411: XML document
structures must start and end within the same entity.
I appreciate that the XML for these feeds is likely invalid.
As for the error that you're getting below, it's due to the fact that Nutch
can't reliable differentiate between the mime type of different XML content.
So, to Nutch, even though it's a .rss file, its mime type is
application/xml. Because the mime type, though a true mime type of the file,
is not the preferred mime type (application/rss+xml, or the like), Nutch has
trouble finding the appropriate parser to parse the content. For instance,
according to parse-plugins.xml (a file in your $NUTCH_HOME/conf directory),
the parse-rss plugin and the feed plugin are registered to parse
application/rss+xml, but not application/xml.
The current trunk version of Nutch recently had a fix committed for this
very issue (http://issues.apache.org/jira/browse/NUTCH-562).
Thanks for this, I've just started using svn trunk. Though after trying
it I was getting mime errors for application/atom+xml. Presumably I can
fix this by registering the following in parse-plugins.xml?
<mimeType name="application/atom+xml">
<plugin id="parse-rss"/>
<plugin id="feed"/>
</mimeType>
Another problem I seem to have just now is that some of the search
results link to their XML feeds, rather than to the destination of their
items. Is there anyway to ensure that the results point to the content
itself rather than the feed? Further to this the summaries often seem
to contain clips of HTML, URLs, and HTML entities etc... I'm assuming
this is a side effect of the feeds being indexed rather than their
items, is this correct? What can be done about it?
I'm also getting to the stage where I want to programatically access
Nutch from another Java application/process. Does Nutch support doing
this in a client/server manner?
If you have any more specific questions, I'd be happy to answer them.
Thanks, it's much appreciated.
R.