Re: fetch pattern

ahammad Mon, 23 Feb 2009 07:17:37 -0800

Hello,

All you would need to do is to change that line to:

+^http://([a-z0-9]*\.)*myCompany.com/myServlet?

That's what the filter will do. It will search for all the pages in any of
the subdomains that have /myServlet? in them. 

In terms of filtering, there are other options that you can play with in
nutch-default.xml. Crawl with the default settings first, and if you get too
many (or too little) results, start looking at the nutch-default.xml file.

Cheers

cemsoft wrote:
> 
> 
> hi
> 
> how or where can i define the urls while crawling
> i want to index only the sites which has a certain link format eg.
> 
> http://www.myCompany.com/myServlet?
> (while crawling i have now all the links under my company host but i need
> more filtering)
> 
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*myCompany.com/
> 
> index  all pages whose link starts with
> "http://www.myCompany.com/myServlet?";.....
> 
> thnx for any idea
> 
> regards
> cem
> 

-- 
View this message in context: 
http://www.nabble.com/fetch-pattern-tp22101517p22163422.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: fetch pattern

Reply via email to