Re: the question of the nutch's ability!

Dennis Kubes Thu, 25 Dec 2008 21:49:26 -0800


Sean Dean wrote:

When used with inexpensive commodity hardware (dual core desktop-class CPU, 8GB 
RAM and fast hard drive) Nutch can usually handle around 20 million pages per 
search node. Using simple math that would mean it would take 50 nodes to serve 
a 1 billion page index.

This type of setup would likely be able to produce 3-6 searches per second at 
"Google comparable" speeds. If your expecting a high volume site (to pay for 50 
nodes plus bandwidth you would have too) I would employ a stable caching method (Squid or 
Varnish) in front of your search nodes to speed up common queries and maybe add a fast 
SSD (32GB) in each node for holding the index only. The near-zero access time and blazing 
fast read speed (write speed might be slower then our fast hard drive but we don't care 
in this case) of the SSD should provide you with considerably more searches per second on 
the nodes.

There have been setups discussed were you place your entire index in RAM which 
provides the same speed benefits I wrote above, this type of setup is cheaper 
if your index is only 2-4 million pages in size but not realistic if your 
looking at creating indexes in the multi-millions or billions. You would need 
~333 search nodes to complete your setup using this method.

Yes and no. It is much more expensive and it does take more machinesfor in memory search versus on disk search, but it can be made to scaleinto the thousands of queries a second, which is where Google and Yahooare at. With 16G ram you can fit a 8-10M page index into memory. Soassuming 10M would take ~100 machines to serve a billion page index.

You would probably have a 4-5 billion page main index and a 10-15billion page supplemental index. Assuming 5:1 ratio for a supplementalindex, you would have 600-700 nodes in a search serving cluster for aGoogle size index of around 20 billion pages at their throughput of 3000queries a second. And yes there would be a significant caching andpre-caching layer in front of it using something like varnish or squid.


Dennis

The next question you should ask yourself is, how will I fetch all the data? My best 
answer with keeping the cheapest setup in mind is to "double duty" your search 
nodes so that they become data nodes, etc. during the fetch and indexing cycle. This is 
where Hadoop comes into play and you will most likely want to research that a bit also.

________________________________
From: Laurent Laborde <[email protected]>
To: [email protected]
Sent: Thursday, December 25, 2008 10:26:24 PM
Subject: Re: the question of the nutch's ability!

On Fri, Dec 26, 2008 at 2:35 AM, buddha1021 <[email protected]> wrote:

hi all:
  I am very interested in the nutch! I want to ask some questions about
nutch:
  (1)Can nutch search 1 billion(=1000 millions) pages that the size of the
page's data will achive 10T(=10000G) bytes? one page's size ==10k .


Nutch use hadoop, which rely on HDFS to store it's data.
and HDFS can certainly handle 10TB.

  (2)If nutch can do this ,what about the speed of the search ,compared with
google ? Can the speed of the search meet the people's requirement ?


I don't know. you should take a look at lucence performance.
http://www.google.fr/search?q=lucene+performance

  (3)If nutch can do this ,how many nodes would be required?


i'd like to know too :)

Re: the question of the nutch's ability!

Reply via email to