Hi Shri,
it would be great if you could summarize
your tips and tricks at the end of your testing.

Please upload a copy to the new wiki at
http://wiki.apache.org/nutch/

Thanks
Olaf



On Tue, 22 Feb 2005 09:41:17 +0800, Admin @ LocalSearch.HK
<[EMAIL PROTECTED]> wrote:
> Hi Olaf / Everyone else,
> 
> I've solved the problem -- which was related to having changed the wrong
> urlfilter file. I also thought that the rules in the urlfilter would be an
> err.. inclusive irrespective of the order i.e.
> 
> +abc
> -.
> and
> -.
> +abc
> 
> would result in the same crawl. (Silly mistake on my part.. was not thinking
> at that point).
> 
> I am now busy doing the first set of indexing rounds to tweak what we need
> to include and exclude from our database.
> 
> Should have a blog and a day by day report going on
> http://www.localsearch.hk/blog by the end of the week. Hopefully should
> serve as a good starting point for newbies like me who are not exactly java
> programmers.
> 
> Having done SEO work for my sites, I now have a pretty good perspective of
> what the major engines go through and the brilliant job you folks have done.
> 
> Shri
> ----- Original Message -----
> From: "Olaf Thiele" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Tuesday, February 22, 2005 3:54 AM
> Subject: Re: [Nutch-general] Crawling a specific set of domains -- how to?
> 
> > Hi Shri,
> > what exactly is your problem. The crawler does not restrict itself
> > to the specified domain? It isn't being crawled at all?
> >
> > Cheers
> > Olaf
> >
> >
> >
> > On Mon, 21 Feb 2005 14:12:01 +0800, Shri @ GeoExpat.Com
> > <[EMAIL PROTECTED]> wrote:
> >>
> >> Hi there,
> >>
> >> (This is my first question to the list -- after a couple of weeks of
> >> browsing.)
> >>
> >> First the question:
> >> I'm trying to restrict the crawler to a set of domains. For example, we'd
> >> like to restrict them to .gov.hk domains for a site that allows searching
> >> of
> >> Hong Kong govt sites.
> >>
> >> I have the following setup.
> >>
> >> crawl-urlfilter.txt
> >> # skip file:, ftp:, & mailto: urls
> >> -^(file|ftp|mailto|https):
> >>
> >> # skip image and other suffixes we can't yet parse
> >> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
> >>
> >> # skip URLs containing certain characters as probable queries, etc.
> >> [EMAIL PROTECTED]
> >>
> >> # accept anything else
> >> +^http://([a-z0-9]*\.)*.gov.hk
> >>
> >> Next I have the url http://www.info.gov.hk being injected from a urllist.
> >>
> >> Any ideas on what I'm doing wrong?
> >>
> >> Second:
> >>
> >> Must complement the developers. Great job and look forward to being a
> >> contributor (please be gentle.. I am not a java programmer.. but I can
> >> tweak
> >> the hell out of php).
> >>
> >> Regards,
> >> Shri
> >>
> >> ------------------------------------------------
> >> GeoClicks
> >> Unit 709, Cyberport 1,
> >> 100 Cyberport Road,
> >> Pokfulam, Hong Kong
> >> Phone: 2989-9145
> >> Fax: 2989-9143
> >
> >
> > --
> >
> > <SimpleHuman gender="male">
> >   <Physical name="Olaf Thiele" />
> >   <Virtual adress="http://www.olafthiele.de"; />
> > </SimpleHuman>
> >
> >
> > -------------------------------------------------------
> > SF email is sponsored by - The IT Product Guide
> > Read honest & candid reviews on hundreds of IT Products from real users.
> > Discover which products truly live up to the hype. Start reading now.
> > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> > _______________________________________________
> > Nutch-general mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/nutch-general
> >
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Nutch-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/nutch-general
> 


-- 

<SimpleHuman gender="male">
   <Physical name="Olaf Thiele" />
   <Virtual adress="http://www.olafthiele.de"; />
</SimpleHuman>


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to