When used with inexpensive commodity hardware (dual core desktop-class CPU, 8GB 
RAM and fast hard drive) Nutch can usually handle around 20 million pages per 
search node. Using simple math that would mean it would take 50 nodes to serve 
a 1 billion page index.

This type of setup would likely be able to produce 3-6 searches per second at 
"Google comparable" speeds. If your expecting a high volume site (to pay for 50 
nodes plus bandwidth you would have too) I would employ a stable caching method 
(Squid or Varnish) in front of your search nodes to speed up common queries and 
maybe add a fast SSD (32GB) in each node for holding the index only. The 
near-zero access time and blazing fast read speed (write speed might be slower 
then our fast hard drive but we don't care in this case) of the SSD should 
provide you with considerably more searches per second on the nodes.

There have been setups discussed were you place your entire index in RAM 
which provides the same speed benefits I wrote above, this type of setup is 
cheaper if your index is only 2-4 million pages in size but not realistic if 
your looking at creating indexes in the multi-millions or billions. You would 
need ~333 search nodes to complete your setup using this method.

The next question you should ask yourself is, how will I fetch all the data? My 
best answer with keeping the cheapest setup in mind is to "double duty" your 
search nodes so that they become data nodes, etc. during the fetch and indexing 
cycle. This is where Hadoop comes into play and you will most likely want to 
research that a bit also.



________________________________
From: Laurent Laborde <[email protected]>
To: [email protected]
Sent: Thursday, December 25, 2008 10:26:24 PM
Subject: Re: the question of the nutch's ability!

On Fri, Dec 26, 2008 at 2:35 AM, buddha1021 <[email protected]> wrote:
>
> hi all:
>  I am very interested in the nutch! I want to ask some questions about
> nutch:
>  (1)Can nutch search 1 billion(=1000 millions) pages that the size of the
> page's data will achive 10T(=10000G) bytes? one page's size ==10k .

Nutch use hadoop, which rely on HDFS to store it's data.
and HDFS can certainly handle 10TB.

>  (2)If nutch can do this ,what about the speed of the search ,compared with
> google ? Can the speed of the search meet the people's requirement ?

I don't know. you should take a look at lucence performance.
http://www.google.fr/search?q=lucene+performance

>  (3)If nutch can do this ,how many nodes would be required?

i'd like to know too :)

-- 
F4FQM
Kerunix Flan
Laurent Laborde

Reply via email to