Welcome !!! Nutch is different from anything else I have seen before, but its great and also difficult. So expect to spend some time.
Best way to learn is practice to understand what it does. 1. Front-End (search) : is a web site which wraps a Lucene based index. If you are not familiar with Lucene you can buy yourself the book Lucene in action, but it is not really necessary. You can also use Solr as a more sophisticated front end. 2. Back-End (crawling to indexing) crawling is done in a number of steps (read the wiki) and uses two critical database crawldb and linkdb to maintain a graph of where the engine has gone. It will fetch, parse, index pages... 3. Cluster / Cloud computing Based on hadoop it uses map/reduce parallel processing technique for the different steps. There is an Hadoop book you can buy. Good luck and see you on the mailing list. 2009/12/11, mengel <[email protected]>: > Hello,Dear: > I am a freshman for Nutch. I want to learn nutch, but I can't find a > document for design such as architecture. Can you give me some advice for > how to learn Nutch.Thank you very much. > > Mengel > > -- -MilleBii-
