When used with inexpensive commodity hardware (dual core desktop-class CPU,
8GB RAM and fast hard drive) Nutch can usually handle around 20 million
pages per search node. Using simple math that would mean it would take 50
nodes to serve a 1 billion page index.

This type of setup would likely be able to produce 3-6 searches per second
at "Google comparable" speeds. If your expecting a high volume site (to pay
for 50 nodes plus bandwidth you would have too) I would employ a stable
caching method (Squid or Varnish) in front of your search nodes to speed up
common queries and maybe add a fast SSD (32GB) in each node for holding the
index only. The near-zero access time and blazing fast read speed (write
speed might be slower then our fast hard drive but we don't care in this
case) of the SSD should provide you with considerably more searches per
second on the nodes.

There have been setups discussed were you place your entire index in RAM
which provides the same speed benefits I wrote above, this type of setup is
cheaper if your index is only 2-4 million pages in size but not realistic if
your looking at creating indexes in the multi-millions or billions. You
would need ~333 search nodes to complete your setup using this method.

The next question you should ask yourself is, how will I fetch all the data?
My best answer with keeping the cheapest setup in mind is to "double duty"
your search nodes so that they become data nodes, etc. during the fetch and
indexing cycle. This is where Hadoop comes into play and you will most
likely want to research that a bit also.



________________________________
thank you very much!!!
but ,in this condition,how many bandwiths (whitch would be used for nutch to
access to the internet for people to search! and also for nutch to fetch
pages!) would be required?
-- 
View this message in context: 
http://www.nabble.com/the-question-of-the-nutch%27s-ability%21-tp21171116p21172476.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to