Re: Feed Plugin Crawl Links

Julien Nioche Thu, 08 Aug 2013 07:07:45 -0700

Hi Rich,

What you need to do is doable with a bit of hacking :


My requirement is to have a single index entry for each feed item.  My
> choices are:
>
> 1)  Use the "parse-tika" plugin, which crawls the links that each RSS item
> contains, but collects none of the RSS item metadata (author, published
> date, geodata).
>

I haven't checked that it works but you could write a HTMLParseFilter and
see if you could pull the metadata from the feed page (assuming that Tika
makes them available for you) then add them to the outlinks. See the
urlmeta plugin for how to make sure that the metadata are carried through
to the page.

This requires a modification of Nutch so that metadata can be added to a
newly created outlink. I meant to open a JIRA and attach a patch for it and
will do it shortly.

When fetching the outlinks, you'd get the 'normal' content as well as the
metadata you got from the feed page.


>
> 2)  Use the "feed" plugin, which recognizes each RSS item as such and
> collects the metadata I am interested in, but doesn't include the content
> of the linked page in the index.
>

Alternatively you could modify the feed plugin so that it creates an
outlink for each RSS item and pass it the metadata of interest. However
make sure you don't create parse objects for the items as the content would
be fetched later.

Whether you want to go the 'tika' or 'feed' way depends on whether you can
get the metadata from Tika.


>
> Right now I am going totally Rambo on this and I am going to try to use
> the protocol factory to create an Http protocol *within* the parser to get
> the content of the linked page so I can add it to the Nutch document for
> indexing.  I'm sure this violates some principle of division of labor
> within Nutch, but . . .
>

Ouch  ;-)

<shameless_plug> metadata handling will be covered in details on day 2 of
the forthcoming Nutch course I am organising in the UK in October
</shameless_plug>

Julien

PS: will open a JIRA for specifying metadata in outlinks



>
> Rich
>
> -----Original Message-----
> From: Sebastian Nagel [mailto:[email protected]]
> Sent: Wednesday, August 07, 2013 6:03 PM
> To: [email protected]
> Subject: Re: Feed Plugin Crawl Links
>
> Hi Richard,
>
> if understood right parse-tika does the job well?
> Extract content + all links including anchor texts?
>
> 1. The plugin "parse-tika" seems indeed better maintained than "feed".
>
> 2. plugin feed is special as it treats a rss file as
>  - one master document (the rss file)
>  - many sub-documents (the links, each with title, description)
>    but these are "stored" in ParseResult by the link URL.
>    Try parsechecker: it will show one ParseResult for each
>    RSS item.
>
> The concept of sub-documents might be useful (e.g., also for zip files)
> but has the disadvantage that sub-document URLs are either artificial or
> duplicate real URLs. And to be honest, I don't know whether sub-documents
> "work" in recent Nutch versions.
>
> > Using this configuration the feed parser plugin is NEVER invoked, but my
> links are crawled.
> > Switching the order results in only the title and description of the
> > feed item being indexed, but the link is not crawled (e.g., perhaps the
> parse-tika plugin is not called?).
>
> Why not simply disable plugin "feed" by removing it from property
> "plugin.includes"?
>
> Sebastian
>
> On 08/07/2013 06:55 PM, Richard Bergmann wrote:
> > NUTCH 1.7
> >
> > I am using the feed parse plugin to consume index an RSS feed.  While
> content (title and description) of each feed item is indexed, what I would
> *really* like is to crawl and index the content of the page that the item
> links to.
> >
> > Is this something that is supposed to happen but is not for some reason
> (i.e., I have it configured improperly)?  Or is it not designed to crawl
> the link?  If the latter, is there some way to *make* it crawl that link.
> >
> > FYI, the parse-plugins.xml file has (for relevant RSS entries):
> >
> > <mimeType name="application/xml">
> >   <plugin id="parse-tika" />
> >   <plugin id="feed" />
> > </mimeType>
> >
> > Using this configuration the feed parser plugin is NEVER invoked, but my
> links are crawled.  Switching the order results in only the title and
> description of the feed item being indexed, but the link is not crawled
> (e.g., perhaps the parse-tika plugin is not called?).
> >
> > Rich Bergmann
> >
>
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Feed Plugin Crawl Links

Reply via email to