Hi there,
 
(This is my first question to the list -- after a couple of weeks of browsing.)
 
First the question:
I'm trying to restrict the crawler to a set of domains. For example, we'd like to restrict them to .gov.hk domains for a site that allows searching of Hong Kong govt sites.
 
I have the following setup.
 
crawl-urlfilter.txt
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto|https):
 
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
 
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
 
# accept anything else
+^http://([a-z0-9]*\.)*.gov.hk
Next I have the url http://www.info.gov.hk being injected from a urllist.
 
Any ideas on what I'm doing wrong?
 
Second:
 
Must complement the developers. Great job and look forward to being a contributor (please be gentle.. I am not a java programmer.. but I can tweak the hell out of php).
 
Regards,
Shri
 
------------------------------------------------
GeoClicks
Unit 709, Cyberport 1,
100 Cyberport Road,
Pokfulam, Hong Kong
Phone: 2989-9145
Fax: 2989-9143

Reply via email to