[Nutch-general] Crawling a specific set of domains -- how to?

Shri @ GeoExpat.Com Sun, 20 Feb 2005 22:13:04 -0800

Hi there,

(This is my first question to the list -- after a couple of weeks of browsing.)

First the question:

I'm trying to restrict the crawler to a set of domains. For example, we'd like to restrict them to .gov.hk domains for a site that allows searching of Hong Kong govt sites.

I have the following setup.

crawl-urlfilter.txt

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto|https):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# accept anything else
+^http://([a-z0-9]*\.)*.gov.hk

Next I have the url http://www.info.gov.hk being injected from a urllist.

Any ideas on what I'm doing wrong?

Second:

Must complement the developers. Great job and look forward to being a contributor (please be gentle.. I am not a java programmer.. but I can tweak the hell out of php).

Regards,

Shri

------------------------------------------------
GeoClicks
Unit 709, Cyberport 1,
100 Cyberport Road,
Pokfulam, Hong Kong
Phone: 2989-9145
Fax: 2989-9143

[Nutch-general] Crawling a specific set of domains -- how to?

Reply via email to