RE: Feed Plugin Crawl Links

Richard Bergmann Thu, 08 Aug 2013 05:23:13 -0700

Well, "does the job well" is a bit of an overstatement.  :-)

My requirement is to have a single index entry for each feed item.  My choices 
are:

1)  Use the "parse-tika" plugin, which crawls the links that each RSS item 
contains, but collects none of the RSS item metadata (author, published date, 
geodata).

2)  Use the "feed" plugin, which recognizes each RSS item as such and collects 
the metadata I am interested in, but doesn't include the content of the linked 
page in the index.

Right now I am going totally Rambo on this and I am going to try to use the 
protocol factory to create an Http protocol *within* the parser to get the 
content of the linked page so I can add it to the Nutch document for indexing.  
I'm sure this violates some principle of division of labor within Nutch, but . 
. .

Rich

-----Original Message-----
From: Sebastian Nagel [mailto:[email protected]] 
Sent: Wednesday, August 07, 2013 6:03 PM
To: [email protected]
Subject: Re: Feed Plugin Crawl Links

Hi Richard,

if understood right parse-tika does the job well?
Extract content + all links including anchor texts?

1. The plugin "parse-tika" seems indeed better maintained than "feed".

2. plugin feed is special as it treats a rss file as
 - one master document (the rss file)
 - many sub-documents (the links, each with title, description)
   but these are "stored" in ParseResult by the link URL.
   Try parsechecker: it will show one ParseResult for each
   RSS item.

The concept of sub-documents might be useful (e.g., also for zip files) but has 
the disadvantage that sub-document URLs are either artificial or duplicate real 
URLs. And to be honest, I don't know whether sub-documents "work" in recent 
Nutch versions.

> Using this configuration the feed parser plugin is NEVER invoked, but my 
> links are crawled.
> Switching the order results in only the title and description of the 
> feed item being indexed, but the link is not crawled (e.g., perhaps the 
> parse-tika plugin is not called?).

Why not simply disable plugin "feed" by removing it from property 
"plugin.includes"?

Sebastian

On 08/07/2013 06:55 PM, Richard Bergmann wrote:
> NUTCH 1.7
> 
> I am using the feed parse plugin to consume index an RSS feed.  While content 
> (title and description) of each feed item is indexed, what I would *really* 
> like is to crawl and index the content of the page that the item links to.
> 
> Is this something that is supposed to happen but is not for some reason 
> (i.e., I have it configured improperly)?  Or is it not designed to crawl the 
> link?  If the latter, is there some way to *make* it crawl that link.
> 
> FYI, the parse-plugins.xml file has (for relevant RSS entries):
> 
> <mimeType name="application/xml">
>   <plugin id="parse-tika" />
>   <plugin id="feed" />
> </mimeType>
> 
> Using this configuration the feed parser plugin is NEVER invoked, but my 
> links are crawled.  Switching the order results in only the title and 
> description of the feed item being indexed, but the link is not crawled 
> (e.g., perhaps the parse-tika plugin is not called?).
> 
> Rich Bergmann
>

RE: Feed Plugin Crawl Links

Reply via email to