Hi Shri, it would be great if you could summarize your tips and tricks at the end of your testing.
Please upload a copy to the new wiki at http://wiki.apache.org/nutch/ Thanks Olaf On Tue, 22 Feb 2005 09:41:17 +0800, Admin @ LocalSearch.HK <[EMAIL PROTECTED]> wrote: > Hi Olaf / Everyone else, > > I've solved the problem -- which was related to having changed the wrong > urlfilter file. I also thought that the rules in the urlfilter would be an > err.. inclusive irrespective of the order i.e. > > +abc > -. > and > -. > +abc > > would result in the same crawl. (Silly mistake on my part.. was not thinking > at that point). > > I am now busy doing the first set of indexing rounds to tweak what we need > to include and exclude from our database. > > Should have a blog and a day by day report going on > http://www.localsearch.hk/blog by the end of the week. Hopefully should > serve as a good starting point for newbies like me who are not exactly java > programmers. > > Having done SEO work for my sites, I now have a pretty good perspective of > what the major engines go through and the brilliant job you folks have done. > > Shri > ----- Original Message ----- > From: "Olaf Thiele" <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]> > Sent: Tuesday, February 22, 2005 3:54 AM > Subject: Re: [Nutch-general] Crawling a specific set of domains -- how to? > > > Hi Shri, > > what exactly is your problem. The crawler does not restrict itself > > to the specified domain? It isn't being crawled at all? > > > > Cheers > > Olaf > > > > > > > > On Mon, 21 Feb 2005 14:12:01 +0800, Shri @ GeoExpat.Com > > <[EMAIL PROTECTED]> wrote: > >> > >> Hi there, > >> > >> (This is my first question to the list -- after a couple of weeks of > >> browsing.) > >> > >> First the question: > >> I'm trying to restrict the crawler to a set of domains. For example, we'd > >> like to restrict them to .gov.hk domains for a site that allows searching > >> of > >> Hong Kong govt sites. > >> > >> I have the following setup. > >> > >> crawl-urlfilter.txt > >> # skip file:, ftp:, & mailto: urls > >> -^(file|ftp|mailto|https): > >> > >> # skip image and other suffixes we can't yet parse > >> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$ > >> > >> # skip URLs containing certain characters as probable queries, etc. > >> [EMAIL PROTECTED] > >> > >> # accept anything else > >> +^http://([a-z0-9]*\.)*.gov.hk > >> > >> Next I have the url http://www.info.gov.hk being injected from a urllist. > >> > >> Any ideas on what I'm doing wrong? > >> > >> Second: > >> > >> Must complement the developers. Great job and look forward to being a > >> contributor (please be gentle.. I am not a java programmer.. but I can > >> tweak > >> the hell out of php). > >> > >> Regards, > >> Shri > >> > >> ------------------------------------------------ > >> GeoClicks > >> Unit 709, Cyberport 1, > >> 100 Cyberport Road, > >> Pokfulam, Hong Kong > >> Phone: 2989-9145 > >> Fax: 2989-9143 > > > > > > -- > > > > <SimpleHuman gender="male"> > > <Physical name="Olaf Thiele" /> > > <Virtual adress="http://www.olafthiele.de" /> > > </SimpleHuman> > > > > > > ------------------------------------------------------- > > SF email is sponsored by - The IT Product Guide > > Read honest & candid reviews on hundreds of IT Products from real users. > > Discover which products truly live up to the hype. Start reading now. > > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > > _______________________________________________ > > Nutch-general mailing list > > [email protected] > > https://lists.sourceforge.net/lists/listinfo/nutch-general > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Nutch-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/nutch-general > -- <SimpleHuman gender="male"> <Physical name="Olaf Thiele" /> <Virtual adress="http://www.olafthiele.de" /> </SimpleHuman> ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
