On some of the sites I want to index with nutch, there are only
specific types of pages I would like to be searchable. I need a way to
be able to crawl these sites, but only index pages that match a
certain regular expression.

ex:

www.example.com/browse/ finds links in the form of
www.example.com/items/1234.html and
www.example.com/items/browse_by_xyz.html . I need to be able to index
just the www.example.com/items/1234.html style links while still
crawling the browse_by_xyz.html style links.

>From my searching I thought that I could use crawl-urlfilter.txt to
restrict where Nutch crawled, and regex-urlfilter.txt to restrict what
was actually indexed. This did not seem to work, so I was either
misinformed or implemented it correctly.

Does Nutch have the capability I am looking for?

Reply via email to