Hello, All you would need to do is to change that line to:
+^http://([a-z0-9]*\.)*myCompany.com/myServlet? That's what the filter will do. It will search for all the pages in any of the subdomains that have /myServlet? in them. In terms of filtering, there are other options that you can play with in nutch-default.xml. Crawl with the default settings first, and if you get too many (or too little) results, start looking at the nutch-default.xml file. Cheers cemsoft wrote: > > > hi > > how or where can i define the urls while crawling > i want to index only the sites which has a certain link format eg. > > http://www.myCompany.com/myServlet? > (while crawling i have now all the links under my company host but i need > more filtering) > > # accept hosts in MY.DOMAIN.NAME > +^http://([a-z0-9]*\.)*myCompany.com/ > > index all pages whose link starts with > "http://www.myCompany.com/myServlet?"..... > > thnx for any idea > > regards > cem > -- View this message in context: http://www.nabble.com/fetch-pattern-tp22101517p22163422.html Sent from the Nutch - User mailing list archive at Nabble.com.
