At the moment we are using nutch-nightly (nutch-2005-07-20). We are not pleased 
with productivity of fetching, parsing, indexing, analyzing and scoring... 
information. Now our spider retrieves approx 25,000 new results per day. All 
processes now running on one computer (machine) and we are using local file 
system. We suppose that if we want to raise productivity we need to use cluster.

 

 

1)       Is there any intermediates (storage - ready solutions) for 
clusterization  Nutch?

 

2)       Tell us please if there was experience of clusterization Nutch, and 
what productivity was achieved? And how many computers were used?

 

3)       We are interested: what tasks we can divide into different computers 
and what tasks we can not? And in what way synchronization

 

of  those tasks must be done?

 

4)      Will speed of spiders work increase if we will use 
NutchDistributedFileSystem ? What are the advantages and disadvantages  
NutchDistributedFileSystem  have in using?

 

5)      We were advised to use  nutch mapred branch. Should we use it?

Reply via email to