Re: Feed Plugin Crawl Links

Sebastian Nagel Wed, 07 Aug 2013 15:33:35 -0700

Hi Richard,

if understood right parse-tika does the job well?
Extract content + all links including anchor texts?


1. The plugin "parse-tika" seems indeed better maintained
than "feed".

2. plugin feed is special as it treats a rss file as
 - one master document (the rss file)
 - many sub-documents (the links, each with title, description)
   but these are "stored" in ParseResult by the link URL.
   Try parsechecker: it will show one ParseResult for each
   RSS item.

The concept of sub-documents might be useful (e.g., also for zip files)
but has the disadvantage that sub-document URLs are either artificial
or duplicate real URLs. And to be honest, I don't know whether sub-documents
"work" in recent Nutch versions.

> Using this configuration the feed parser plugin is NEVER invoked, but my 
> links are crawled.
> Switching the order results in only the title and description of the feed 
> item being indexed, but
> the link is not crawled (e.g., perhaps the parse-tika plugin is not called?).

Why not simply disable plugin "feed" by removing it from property 
"plugin.includes"?

Sebastian

On 08/07/2013 06:55 PM, Richard Bergmann wrote:
> NUTCH 1.7
> 
> I am using the feed parse plugin to consume index an RSS feed.  While content 
> (title and description) of each feed item is indexed, what I would *really* 
> like is to crawl and index the content of the page that the item links to.
> 
> Is this something that is supposed to happen but is not for some reason 
> (i.e., I have it configured improperly)?  Or is it not designed to crawl the 
> link?  If the latter, is there some way to *make* it crawl that link.
> 
> FYI, the parse-plugins.xml file has (for relevant RSS entries):
> 
> <mimeType name="application/xml">
>   <plugin id="parse-tika" />
>   <plugin id="feed" />
> </mimeType>
> 
> Using this configuration the feed parser plugin is NEVER invoked, but my 
> links are crawled.  Switching the order results in only the title and 
> description of the feed item being indexed, but the link is not crawled 
> (e.g., perhaps the parse-tika plugin is not called?).
> 
> Rich Bergmann
>

Re: Feed Plugin Crawl Links

Reply via email to