Nutch has a file called crawl-urlfilter.txt where you can set your site domain or site list, so nutch will only crawl this list. Download nutch and see it working, is better for you :). Take a look: http://lucene.apache.org/nutch/tutorial8.html
Regards, On 4/5/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote:
Thanks. Can you please tell me how can I plugin in my own handling when nutch sees a site instead of building the search database for that site? On 4/3/07, Lourival Júnior <[EMAIL PROTECTED]> wrote: > I have total certainty that nutch is what are you looking for. Take a look > to nutch's documentation for more details and you will see :). > > On 4/3/07, Meryl Silverburgh <[EMAIL PROTECTED]> wrote: > > > > Hi, > > > > I would like to know if know if it is a good idea to use nutch web > > carwler? > > Basically, this is what I need: > > 1. I have a list of web site > > 2. I want the web crawler to go thru each site, parser the anchor. if > > it is the same domain, go thru the same step for 3 level. > > 3. For each link, write to a new file. > > > > Is nutch a good solution? or there is other better open source > > alternative for my purpose? > > > > Thank you. > > > > > > -- > Lourival Junior > Universidade Federal do Pará > Curso de Bacharelado em Sistemas de Informação > http://www.ufpa.br/cbsi > Msn: [EMAIL PROTECTED] >
-- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]