Re: Crawling site, but only indexing certain pages

Magnús Skúlason Wed, 24 Feb 2010 09:31:07 -0800

Hi,

This is actually very easy, just create a indexing plugging, analyse the url
format and return null from the indexing pluggin if you don't want to index
it.


best regards,
Magnus

On Wed, Feb 24, 2010 at 6:09 PM, Steven Wichers <ste...@devnet.com> wrote:

> On some of the sites I want to index with nutch, there are only
> specific types of pages I would like to be searchable. I need a way to
> be able to crawl these sites, but only index pages that match a
> certain regular expression.
>
> ex:
>
> www.example.com/browse/ finds links in the form of
> www.example.com/items/1234.html and
> www.example.com/items/browse_by_xyz.html . I need to be able to index
> just the www.example.com/items/1234.html style links while still
> crawling the browse_by_xyz.html style links.
>
> From my searching I thought that I could use crawl-urlfilter.txt to
> restrict where Nutch crawled, and regex-urlfilter.txt to restrict what
> was actually indexed. This did not seem to work, so I was either
> misinformed or implemented it correctly.
>
> Does Nutch have the capability I am looking for?
>

Re: Crawling site, but only indexing certain pages

Reply via email to