[Nutch-general] Crawling a large, finite set of sites.

Terry Pothecary Tue, 11 Apr 2006 11:45:18 -0700

Hi. I'm a relative novice with Nutch. I have a custom architecture thatI am finding difficult to support:

I Would like somone to explain to me some of the basics of Nutchoperation so That I can come up with a better solution to the one I have.


I am using Nutch to crawl a specific set of 500,000 named sites.

Each site has a set of tags that have to be included as fields when itspages are indexed by Lucene.

So when I Seed the crawl tool with all the URLS, It takes forever to runand then forever to index.

I would like some help to create a stable, continuously running systemthat I can tweak by occasionaly adding / removing URLs. I also need theindex-and-use cycle to be every 24 hours. Initially the content of thecrawled database will be somewhat sparse but over time it will fill upwith successive depths of the 500,000 seed sites.

Please ask me any more questions you need in order to clarify thissituation, I'm not sure right now what information is relevant to yourunderstanding.



Thanks in advance.
David.





-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Crawling a large, finite set of sites.

Reply via email to