Re: Distributed nutch

Paul Baclace Wed, 09 Nov 2005 15:48:47 -0800

In addition to Stefan Groschupf's detailed references, here are some short, 
high-level answers to your questions:

Rozina Sorathia wrote:
>  1. What is Distributed nutch

 Nutch is a distributed Lucene with large scale web crawling.

>2. How nutch distributed works?

 Modeled after Google's Map-Reduce and Google FS which is a single master, 
multiple slave system tuned for 100-1000 nodes.

>3. When we say distributed, what is distributed?

 The filesystem is distributed with multiple copies of files on separate 
machines.  Crawling, parsing, sorting, and indexing are also distributed.

>4. When one server goes down, what happens?

 If the master goes down, it can be restarted from a checkpointed state file.
 If a slave goes down, there is redundancy so that operations continue, data is 
not lost, and work in progress dependent on the dead node is automatically 
restarted.

Nutch version 0.8 is distributed (still under development in the "mapred" 
branch) and earlier versions are not distributed.

Re: Distributed nutch

Reply via email to