Hi Stefan,
Thank you for your suggesiton. If we filter pages in the index step, it will also caurse storage consumption for trash pages to me. Anyway, for a intranet crawl, maybe it's tolerant. A past thread has mentioned that we can use only the fetcher of nutch to achieve some tasks. So is it possible to use the fetcher iteratively until we find the required links and then sotre and index them? You wrote: The way I go is that I index such pages anyway but 'tag' them. So I use a index filter for that and tag the positive pages with a other tag. Like this category:trash or category:nugget. Than I also use a querfilter plugin and in the ui I extend my query: queryString+ " category:nugget" So you will have only non trash pages in your results. I guess you can also use the prune tool to remove such trash pages the index if you like. HTH Stefan Am 14.02.2006 um 08:11 schrieb Elwin: 2006/2/14, Elwin <[EMAIL PROTECTED]>: > > When using nutch to crawl some sites, I want to index fetched contents > selectively only when the urls to these contents fit my filter, for other > urls I just want nutch to crawl them and parse them without index. > How can I achieve this? Which extension point should I extend? >
