Hi, This is actually very easy, just create a indexing plugging, analyse the url format and return null from the indexing pluggin if you don't want to index it.
best regards, Magnus On Wed, Feb 24, 2010 at 6:09 PM, Steven Wichers <ste...@devnet.com> wrote: > On some of the sites I want to index with nutch, there are only > specific types of pages I would like to be searchable. I need a way to > be able to crawl these sites, but only index pages that match a > certain regular expression. > > ex: > > www.example.com/browse/ finds links in the form of > www.example.com/items/1234.html and > www.example.com/items/browse_by_xyz.html . I need to be able to index > just the www.example.com/items/1234.html style links while still > crawling the browse_by_xyz.html style links. > > From my searching I thought that I could use crawl-urlfilter.txt to > restrict where Nutch crawled, and regex-urlfilter.txt to restrict what > was actually indexed. This did not seem to work, so I was either > misinformed or implemented it correctly. > > Does Nutch have the capability I am looking for? >