Hi Shri,
what exactly is your problem. The crawler does not restrict itself
to the specified domain? It isn't being crawled at all?

Cheers
Olaf



On Mon, 21 Feb 2005 14:12:01 +0800, Shri @ GeoExpat.Com
<[EMAIL PROTECTED]> wrote:
>  
> Hi there, 
>   
> (This is my first question to the list -- after a couple of weeks of
> browsing.) 
>   
> First the question: 
> I'm trying to restrict the crawler to a set of domains. For example, we'd
> like to restrict them to .gov.hk domains for a site that allows searching of
> Hong Kong govt sites. 
>   
> I have the following setup. 
>   
> crawl-urlfilter.txt 
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto|https): 
>   
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
>   
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED] 
>   
> # accept anything else
> +^http://([a-z0-9]*\.)*.gov.hk
>  
> Next I have the url http://www.info.gov.hk being injected from a urllist. 
>   
> Any ideas on what I'm doing wrong? 
>   
> Second: 
>   
> Must complement the developers. Great job and look forward to being a
> contributor (please be gentle.. I am not a java programmer.. but I can tweak
> the hell out of php). 
>   
> Regards, 
> Shri 
>   
> ------------------------------------------------
> GeoClicks 
> Unit 709, Cyberport 1,
> 100 Cyberport Road,
> Pokfulam, Hong Kong
> Phone: 2989-9145
> Fax: 2989-9143 


-- 

<SimpleHuman gender="male">
   <Physical name="Olaf Thiele" />
   <Virtual adress="http://www.olafthiele.de"; />
</SimpleHuman>


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to