When used with inexpensive commodity hardware (dual core desktop-class CPU, 8GB RAM and fast hard drive) Nutch can usually handle around 20 million pages per search node. Using simple math that would mean it would take 50 nodes to serve a 1 billion page index.
This type of setup would likely be able to produce 3-6 searches per second at "Google comparable" speeds. If your expecting a high volume site (to pay for 50 nodes plus bandwidth you would have too) I would employ a stable caching method (Squid or Varnish) in front of your search nodes to speed up common queries and maybe add a fast SSD (32GB) in each node for holding the index only. The near-zero access time and blazing fast read speed (write speed might be slower then our fast hard drive but we don't care in this case) of the SSD should provide you with considerably more searches per second on the nodes. There have been setups discussed were you place your entire index in RAM which provides the same speed benefits I wrote above, this type of setup is cheaper if your index is only 2-4 million pages in size but not realistic if your looking at creating indexes in the multi-millions or billions. You would need ~333 search nodes to complete your setup using this method. The next question you should ask yourself is, how will I fetch all the data? My best answer with keeping the cheapest setup in mind is to "double duty" your search nodes so that they become data nodes, etc. during the fetch and indexing cycle. This is where Hadoop comes into play and you will most likely want to research that a bit also. ________________________________ From: Laurent Laborde <[email protected]> To: [email protected] Sent: Thursday, December 25, 2008 10:26:24 PM Subject: Re: the question of the nutch's ability! On Fri, Dec 26, 2008 at 2:35 AM, buddha1021 <[email protected]> wrote: > > hi all: > I am very interested in the nutch! I want to ask some questions about > nutch: > (1)Can nutch search 1 billion(=1000 millions) pages that the size of the > page's data will achive 10T(=10000G) bytes? one page's size ==10k . Nutch use hadoop, which rely on HDFS to store it's data. and HDFS can certainly handle 10TB. > (2)If nutch can do this ,what about the speed of the search ,compared with > google ? Can the speed of the search meet the people's requirement ? I don't know. you should take a look at lucence performance. http://www.google.fr/search?q=lucene+performance > (3)If nutch can do this ,how many nodes would be required? i'd like to know too :) -- F4FQM Kerunix Flan Laurent Laborde
