|
Hi there,
(This is my first question to the list -- after a
couple of weeks of browsing.)
First the question:
I'm trying to restrict the crawler to a set of
domains. For example, we'd like to restrict them to .gov.hk domains for a site
that allows searching of Hong Kong govt sites.
I have the following setup.
crawl-urlfilter.txt
# skip file:, ftp:, & mailto:
urls
-^(file|ftp|mailto|https): # skip image and other suffixes we can't yet
parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$ # skip URLs containing certain characters as
probable queries, etc.
[EMAIL PROTECTED] # accept anything
else
+^http://([a-z0-9]*\.)*.gov.hk Next I have the url http://www.info.gov.hk being injected from a
urllist.
Any ideas on what I'm doing wrong?
Second:
Must complement the developers. Great job and look
forward to being a contributor (please be gentle.. I am not a java programmer..
but I can tweak the hell out of php).
Regards,
Shri
------------------------------------------------
GeoClicks Unit 709, Cyberport 1, 100 Cyberport Road, Pokfulam, Hong Kong Phone: 2989-9145 Fax: 2989-9143 |
- [Nutch-general] Crawling a specific set of domains ... Shri @ GeoExpat.Com
- Re: [Nutch-general] Crawling a specific set of... Olaf Thiele
- Re: [Nutch-general] Crawling a specific se... Admin @ LocalSearch.HK
- Re: [Nutch-general] Crawling a specifi... Olaf Thiele
