Hi. I'm a relative novice with Nutch. I have a custom architecture that I am finding difficult to support:

I Would like somone to explain to me some of the basics of Nutch operation so That I can come up with a better solution to the one I have.

I am using Nutch to crawl a specific set of 500,000 named sites.
Each site has a set of tags that have to be included as fields when its pages are indexed by Lucene.

So when I Seed the crawl tool with all the URLS, It takes forever to run and then forever to index.

I would like some help to create a stable, continuously running system that I can tweak by occasionaly adding / removing URLs. I also need the index-and-use cycle to be every 24 hours. Initially the content of the crawled database will be somewhat sparse but over time it will fill up with successive depths of the 500,000 seed sites.

Please ask me any more questions you need in order to clarify this situation, I'm not sure right now what information is relevant to your understanding.


Thanks in advance.
David.





-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to