Well, "does the job well" is a bit of an overstatement. :-) My requirement is to have a single index entry for each feed item. My choices are:
1) Use the "parse-tika" plugin, which crawls the links that each RSS item contains, but collects none of the RSS item metadata (author, published date, geodata). 2) Use the "feed" plugin, which recognizes each RSS item as such and collects the metadata I am interested in, but doesn't include the content of the linked page in the index. Right now I am going totally Rambo on this and I am going to try to use the protocol factory to create an Http protocol *within* the parser to get the content of the linked page so I can add it to the Nutch document for indexing. I'm sure this violates some principle of division of labor within Nutch, but . . . Rich -----Original Message----- From: Sebastian Nagel [mailto:[email protected]] Sent: Wednesday, August 07, 2013 6:03 PM To: [email protected] Subject: Re: Feed Plugin Crawl Links Hi Richard, if understood right parse-tika does the job well? Extract content + all links including anchor texts? 1. The plugin "parse-tika" seems indeed better maintained than "feed". 2. plugin feed is special as it treats a rss file as - one master document (the rss file) - many sub-documents (the links, each with title, description) but these are "stored" in ParseResult by the link URL. Try parsechecker: it will show one ParseResult for each RSS item. The concept of sub-documents might be useful (e.g., also for zip files) but has the disadvantage that sub-document URLs are either artificial or duplicate real URLs. And to be honest, I don't know whether sub-documents "work" in recent Nutch versions. > Using this configuration the feed parser plugin is NEVER invoked, but my > links are crawled. > Switching the order results in only the title and description of the > feed item being indexed, but the link is not crawled (e.g., perhaps the > parse-tika plugin is not called?). Why not simply disable plugin "feed" by removing it from property "plugin.includes"? Sebastian On 08/07/2013 06:55 PM, Richard Bergmann wrote: > NUTCH 1.7 > > I am using the feed parse plugin to consume index an RSS feed. While content > (title and description) of each feed item is indexed, what I would *really* > like is to crawl and index the content of the page that the item links to. > > Is this something that is supposed to happen but is not for some reason > (i.e., I have it configured improperly)? Or is it not designed to crawl the > link? If the latter, is there some way to *make* it crawl that link. > > FYI, the parse-plugins.xml file has (for relevant RSS entries): > > <mimeType name="application/xml"> > <plugin id="parse-tika" /> > <plugin id="feed" /> > </mimeType> > > Using this configuration the feed parser plugin is NEVER invoked, but my > links are crawled. Switching the order results in only the title and > description of the feed item being indexed, but the link is not crawled > (e.g., perhaps the parse-tika plugin is not called?). > > Rich Bergmann >

