Sean Dean wrote:
When used with inexpensive commodity hardware (dual core desktop-class CPU, 8GB
RAM and fast hard drive) Nutch can usually handle around 20 million pages per
search node. Using simple math that would mean it would take 50 nodes to serve
a 1 billion page index.
This type of setup would likely be able to produce 3-6 searches per second at
"Google comparable" speeds. If your expecting a high volume site (to pay for 50
nodes plus bandwidth you would have too) I would employ a stable caching method (Squid or
Varnish) in front of your search nodes to speed up common queries and maybe add a fast
SSD (32GB) in each node for holding the index only. The near-zero access time and blazing
fast read speed (write speed might be slower then our fast hard drive but we don't care
in this case) of the SSD should provide you with considerably more searches per second on
the nodes.
There have been setups discussed were you place your entire index in RAM which
provides the same speed benefits I wrote above, this type of setup is cheaper
if your index is only 2-4 million pages in size but not realistic if your
looking at creating indexes in the multi-millions or billions. You would need
~333 search nodes to complete your setup using this method.
Yes and no. It is much more expensive and it does take more machines
for in memory search versus on disk search, but it can be made to scale
into the thousands of queries a second, which is where Google and Yahoo
are at. With 16G ram you can fit a 8-10M page index into memory. So
assuming 10M would take ~100 machines to serve a billion page index.
You would probably have a 4-5 billion page main index and a 10-15
billion page supplemental index. Assuming 5:1 ratio for a supplemental
index, you would have 600-700 nodes in a search serving cluster for a
Google size index of around 20 billion pages at their throughput of 3000
queries a second. And yes there would be a significant caching and
pre-caching layer in front of it using something like varnish or squid.
Dennis
The next question you should ask yourself is, how will I fetch all the data? My best
answer with keeping the cheapest setup in mind is to "double duty" your search
nodes so that they become data nodes, etc. during the fetch and indexing cycle. This is
where Hadoop comes into play and you will most likely want to research that a bit also.
________________________________
From: Laurent Laborde <[email protected]>
To: [email protected]
Sent: Thursday, December 25, 2008 10:26:24 PM
Subject: Re: the question of the nutch's ability!
On Fri, Dec 26, 2008 at 2:35 AM, buddha1021 <[email protected]> wrote:
hi all:
I am very interested in the nutch! I want to ask some questions about
nutch:
(1)Can nutch search 1 billion(=1000 millions) pages that the size of the
page's data will achive 10T(=10000G) bytes? one page's size ==10k .
Nutch use hadoop, which rely on HDFS to store it's data.
and HDFS can certainly handle 10TB.
(2)If nutch can do this ,what about the speed of the search ,compared with
google ? Can the speed of the search meet the people's requirement ?
I don't know. you should take a look at lucence performance.
http://www.google.fr/search?q=lucene+performance
(3)If nutch can do this ,how many nodes would be required?
i'd like to know too :)