[Nutch-dev] Re: Which extension point should I extend?

Elwin Fri, 17 Feb 2006 03:25:05 -0800

Hi Stefan,

Thank you for your suggesiton. If we filter pages in the index step, it will
also caurse storage consumption for trash pages to me. Anyway, for a
intranet crawl, maybe it's tolerant.

A past thread has mentioned that we can use only the fetcher of nutch to
achieve some tasks. So is it possible to use the fetcher iteratively until
we find the required links and then sotre and index them?

You wrote:

The way I go is that I index such pages anyway but 'tag' them. So I

use a index filter for that and tag the positive pages with a other tag.
Like this category:trash or category:nugget.
Than I also use a querfilter plugin and in the ui I extend my query:
queryString+ " category:nugget"
So you will have only non trash pages in your results. I guess you
can also use the prune tool to remove such trash pages the index if
you like.
HTH
Stefan

Am 14.02.2006 um 08:11 schrieb Elwin:

2006/2/14, Elwin <[EMAIL PROTECTED]>:
>
> When using nutch to crawl some sites, I want to index fetched contents
> selectively only when the urls to these contents fit my filter, for other
> urls I just want nutch to crawl them and parse them without index.
> How can I achieve this? Which extension point should I extend?
>

[Nutch-dev] Re: Which extension point should I extend?

Reply via email to