Hi Olaf / Everyone else,

I've solved the problem -- which was related to having changed the wrong urlfilter file. I also thought that the rules in the urlfilter would be an err.. inclusive irrespective of the order i.e.

+abc
-.
and
-.
+abc

would result in the same crawl. (Silly mistake on my part.. was not thinking at that point).

I am now busy doing the first set of indexing rounds to tweak what we need to include and exclude from our database.

Should have a blog and a day by day report going on http://www.localsearch.hk/blog by the end of the week. Hopefully should serve as a good starting point for newbies like me who are not exactly java programmers.

Having done SEO work for my sites, I now have a pretty good perspective of what the major engines go through and the brilliant job you folks have done.

Shri
----- Original Message ----- From: "Olaf Thiele" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, February 22, 2005 3:54 AM
Subject: Re: [Nutch-general] Crawling a specific set of domains -- how to?



Hi Shri,
what exactly is your problem. The crawler does not restrict itself
to the specified domain? It isn't being crawled at all?

Cheers
Olaf



On Mon, 21 Feb 2005 14:12:01 +0800, Shri @ GeoExpat.Com
<[EMAIL PROTECTED]> wrote:

Hi there,

(This is my first question to the list -- after a couple of weeks of
browsing.)

First the question:
I'm trying to restrict the crawler to a set of domains. For example, we'd
like to restrict them to .gov.hk domains for a site that allows searching of
Hong Kong govt sites.


I have the following setup.

crawl-urlfilter.txt
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto|https):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# accept anything else
+^http://([a-z0-9]*\.)*.gov.hk

Next I have the url http://www.info.gov.hk being injected from a urllist.

Any ideas on what I'm doing wrong?

Second:

Must complement the developers. Great job and look forward to being a
contributor (please be gentle.. I am not a java programmer.. but I can tweak
the hell out of php).


Regards,
Shri

------------------------------------------------
GeoClicks
Unit 709, Cyberport 1,
100 Cyberport Road,
Pokfulam, Hong Kong
Phone: 2989-9145
Fax: 2989-9143


--

<SimpleHuman gender="male">
  <Physical name="Olaf Thiele" />
  <Virtual adress="http://www.olafthiele.de"; />
</SimpleHuman>


------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general





-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to