Hi Paul,

Leaving aside the hardware requirements for the crawl...

The main issue with what you need to achieve your is the nature of your index. If you're using the results of a standard Nutch web crawl, then search times < 500ms shouldn't be a problem.

But you actually want something more in the range of say 200ms average, as otherwise you can quickly run into the overlapping search problem...once a search doesn't complete in time before another search starts running, both searches take longer, which increases the odds that the third search happens before the previous search(es) have completed. So the performance can quickly deteriorate under a load that's only slightly higher than your target case.

However getting 200ms time isn't hard either, as long as the hardware is reasonable and the index size isn't huge.

In our experience, using more, cheaper boxes is the way to go. For web crawl data, I would probably got with two 10M page indexes per box, where the Lucene index goes on a smaller, faster drive and the page contents go on a bigger, slower drive. So then you'd have two faster drives and two slower drives per box, and use a dual CPU with dual cores. And 4GB of RAM, so each JVM gets 1.5GB with some breathing room for the OS.

Which means you'd need about five of these servers for 100M pages...unless you want replication for reliability, which means 10 servers.

-- Ken

No, not familiar with that yet - can you send out any URL's?

My question is really whether you're better to try for one or two big
boxes or a series of small boxes - also looking for anyone who has 100
million pages in their index and a description of their hardware as a
reference point...

Thanks!

Paul


-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of VK
Sent: Wednesday, November 28, 2007 9:53 PM
To: [email protected]
Subject: Re: Hardware Planning

Have you considered EC2 + S3?

Also Rightscale has some interesting solutions, which I am currently
evaluating.

On Nov 28, 2007 9:38 PM, Paul Stewart <[EMAIL PROTECTED]> wrote:

 Hi folks...

 I have read the archives and looking for input specific to my
estimated
 requirements:

 > Want to index about 100 million public webpages.  Space and bandwidth
 are not a problem - coming up with the right hardware and keeping the
 cost down is my goal.

 I would estimate only 1-2 searches per second at least during the
first
 hardware phase.

 With that in mind I'm trying to figure out whether to use a couple of
 larger Dell servers or a bunch of small single CPU, 1 Gig RAM, 160 GB
 hard drive type of machines....

 Anyone share what they are using for hardware for about 100 million
 webpages and their search result times etc??  Realworld is important
to
 me and being able to scale is important....
 >
 Thanks,

 Paul









------------------------------------------------------------------------
----

 "The information transmitted is intended only for the person or entity
to
 which it is addressed and contains confidential and/or privileged
material.
 If you received this in error, please contact the sender immediately
and
 then destroy this transmission, including all attachments, without
copying,
 distributing or disclosing same. Thank you."





----------------------------------------------------------------------------

"The information transmitted is intended only for the person or entity to which it is addressed and contains confidential and/or privileged material. If you received this in error, please contact the sender immediately and then destroy this transmission, including all attachments, without copying, distributing or disclosing same. Thank you."


--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Reply via email to