Hi Paul,
Leaving aside the hardware requirements for the crawl...
The main issue with what you need to achieve your is the nature of
your index. If you're using the results of a standard Nutch web
crawl, then search times < 500ms shouldn't be a problem.
But you actually want something more in the range of say 200ms
average, as otherwise you can quickly run into the overlapping search
problem...once a search doesn't complete in time before another
search starts running, both searches take longer, which increases the
odds that the third search happens before the previous search(es)
have completed. So the performance can quickly deteriorate under a
load that's only slightly higher than your target case.
However getting 200ms time isn't hard either, as long as the hardware
is reasonable and the index size isn't huge.
In our experience, using more, cheaper boxes is the way to go. For
web crawl data, I would probably got with two 10M page indexes per
box, where the Lucene index goes on a smaller, faster drive and the
page contents go on a bigger, slower drive. So then you'd have two
faster drives and two slower drives per box, and use a dual CPU with
dual cores. And 4GB of RAM, so each JVM gets 1.5GB with some
breathing room for the OS.
Which means you'd need about five of these servers for 100M
pages...unless you want replication for reliability, which means 10
servers.
-- Ken
No, not familiar with that yet - can you send out any URL's?
My question is really whether you're better to try for one or two big
boxes or a series of small boxes - also looking for anyone who has 100
million pages in their index and a description of their hardware as a
reference point...
Thanks!
Paul
-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of VK
Sent: Wednesday, November 28, 2007 9:53 PM
To: [email protected]
Subject: Re: Hardware Planning
Have you considered EC2 + S3?
Also Rightscale has some interesting solutions, which I am currently
evaluating.
On Nov 28, 2007 9:38 PM, Paul Stewart <[EMAIL PROTECTED]> wrote:
Hi folks...
I have read the archives and looking for input specific to my
estimated
requirements:
> Want to index about 100 million public webpages. Space and bandwidth
are not a problem - coming up with the right hardware and keeping the
cost down is my goal.
I would estimate only 1-2 searches per second at least during the
first
hardware phase.
With that in mind I'm trying to figure out whether to use a couple of
larger Dell servers or a bunch of small single CPU, 1 Gig RAM, 160 GB
hard drive type of machines....
Anyone share what they are using for hardware for about 100 million
webpages and their search result times etc?? Realworld is important
to
me and being able to scale is important....
>
Thanks,
Paul
------------------------------------------------------------------------
----
"The information transmitted is intended only for the person or entity
to
which it is addressed and contains confidential and/or privileged
material.
If you received this in error, please contact the sender immediately
and
then destroy this transmission, including all attachments, without
copying,
distributing or disclosing same. Thank you."
----------------------------------------------------------------------------
"The information transmitted is intended only for the person or
entity to which it is addressed and contains confidential and/or
privileged material. If you received this in error, please contact
the sender immediately and then destroy this transmission, including
all attachments, without copying, distributing or disclosing same.
Thank you."
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"