Hi Shri, what exactly is your problem. The crawler does not restrict itself to the specified domain? It isn't being crawled at all?
Cheers Olaf On Mon, 21 Feb 2005 14:12:01 +0800, Shri @ GeoExpat.Com <[EMAIL PROTECTED]> wrote: > > Hi there, > > (This is my first question to the list -- after a couple of weeks of > browsing.) > > First the question: > I'm trying to restrict the crawler to a set of domains. For example, we'd > like to restrict them to .gov.hk domains for a site that allows searching of > Hong Kong govt sites. > > I have the following setup. > > crawl-urlfilter.txt > # skip file:, ftp:, & mailto: urls > -^(file|ftp|mailto|https): > > # skip image and other suffixes we can't yet parse > -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$ > > # skip URLs containing certain characters as probable queries, etc. > [EMAIL PROTECTED] > > # accept anything else > +^http://([a-z0-9]*\.)*.gov.hk > > Next I have the url http://www.info.gov.hk being injected from a urllist. > > Any ideas on what I'm doing wrong? > > Second: > > Must complement the developers. Great job and look forward to being a > contributor (please be gentle.. I am not a java programmer.. but I can tweak > the hell out of php). > > Regards, > Shri > > ------------------------------------------------ > GeoClicks > Unit 709, Cyberport 1, > 100 Cyberport Road, > Pokfulam, Hong Kong > Phone: 2989-9145 > Fax: 2989-9143 -- <SimpleHuman gender="male"> <Physical name="Olaf Thiele" /> <Virtual adress="http://www.olafthiele.de" /> </SimpleHuman> ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
