Hi Richard, if understood right parse-tika does the job well? Extract content + all links including anchor texts?
1. The plugin "parse-tika" seems indeed better maintained than "feed". 2. plugin feed is special as it treats a rss file as - one master document (the rss file) - many sub-documents (the links, each with title, description) but these are "stored" in ParseResult by the link URL. Try parsechecker: it will show one ParseResult for each RSS item. The concept of sub-documents might be useful (e.g., also for zip files) but has the disadvantage that sub-document URLs are either artificial or duplicate real URLs. And to be honest, I don't know whether sub-documents "work" in recent Nutch versions. > Using this configuration the feed parser plugin is NEVER invoked, but my > links are crawled. > Switching the order results in only the title and description of the feed > item being indexed, but > the link is not crawled (e.g., perhaps the parse-tika plugin is not called?). Why not simply disable plugin "feed" by removing it from property "plugin.includes"? Sebastian On 08/07/2013 06:55 PM, Richard Bergmann wrote: > NUTCH 1.7 > > I am using the feed parse plugin to consume index an RSS feed. While content > (title and description) of each feed item is indexed, what I would *really* > like is to crawl and index the content of the page that the item links to. > > Is this something that is supposed to happen but is not for some reason > (i.e., I have it configured improperly)? Or is it not designed to crawl the > link? If the latter, is there some way to *make* it crawl that link. > > FYI, the parse-plugins.xml file has (for relevant RSS entries): > > <mimeType name="application/xml"> > <plugin id="parse-tika" /> > <plugin id="feed" /> > </mimeType> > > Using this configuration the feed parser plugin is NEVER invoked, but my > links are crawled. Switching the order results in only the title and > description of the feed item being indexed, but the link is not crawled > (e.g., perhaps the parse-tika plugin is not called?). > > Rich Bergmann >

