At the moment we are using nutch-nightly (nutch-2005-07-20). We are not
pleased with productivity of fetching, parsing, indexing, analyzing and
scoring... information. Now our spider retrieves approx 25,000 new
results per day. All processes now running on one computer (machine) and
we are using local file system. We suppose that if we want to raise
productivity we need to use cluster.

 

1)       Is there any intermediates (storage - ready solutions) for
clusterization  Nutch?

2)       Tell us please if there was experience of clusterization Nutch,
and what productivity was achieved? And how many computers were used?

3)       We are interested: what tasks we can divide into different
computers and what tasks we can not? And in what way synchronization

of  those tasks must be done?

4)      Will speed of spiders work increase if we will use
NutchDistributedFileSystem ? What are the advantages and disadvantages
NutchDistributedFileSystem  have in using?

5)      We were advised to use  nutch mapred branch. Should we use it?

 

Reply via email to