Thanks very much for the details... I appreciate it...

I'd be happy with the 500ms range on *average* but totally understand
your point about searches "piling up"....

So you're suggesting about 20 million pages per box - each box with 4
drives, dual CPU and 4 gig RAM?

I guess what I don't totally understand is what servers need lots of RAM
and which ones need all the storage etc for sure.  I was thinking of
some low end boxes (2 gig RAM, 160 Gig HD, single low end processor) for
storage and a couple of heftier boxes (dual cpu, 4 Gig RAM, 500 GB hard
drives)  - is this way off track?

What needs RAM and what needs storage in the components of Nutch?

Thanks again,

Paul


-----Original Message-----
From: Ken Krugler [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 29, 2007 11:13 AM
To: [email protected]
Subject: RE: Hardware Planning

Hi Paul,

Leaving aside the hardware requirements for the crawl...

The main issue with what you need to achieve your is the nature of
your index. If you're using the results of a standard Nutch web
crawl, then search times < 500ms shouldn't be a problem.

But you actually want something more in the range of say 200ms
average, as otherwise you can quickly run into the overlapping search
problem...once a search doesn't complete in time before another
search starts running, both searches take longer, which increases the
odds that the third search happens before the previous search(es)
have completed. So the performance can quickly deteriorate under a
load that's only slightly higher than your target case.

However getting 200ms time isn't hard either, as long as the hardware
is reasonable and the index size isn't huge.

In our experience, using more, cheaper boxes is the way to go. For
web crawl data, I would probably got with two 10M page indexes per
box, where the Lucene index goes on a smaller, faster drive and the
page contents go on a bigger, slower drive. So then you'd have two
faster drives and two slower drives per box, and use a dual CPU with
dual cores. And 4GB of RAM, so each JVM gets 1.5GB with some
breathing room for the OS.

Which means you'd need about five of these servers for 100M
pages...unless you want replication for reliability, which means 10
servers.

-- Ken

>No, not familiar with that yet - can you send out any URL's?
>
>My question is really whether you're better to try for one or two big
>boxes or a series of small boxes - also looking for anyone who has 100
>million pages in their index and a description of their hardware as a
>reference point...
>
>Thanks!
>
>Paul
>
>
>-----Original Message-----
>From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of VK
>Sent: Wednesday, November 28, 2007 9:53 PM
>To: [email protected]
>Subject: Re: Hardware Planning
>
>Have you considered EC2 + S3?
>
>Also Rightscale has some interesting solutions, which I am currently
>evaluating.
>
>On Nov 28, 2007 9:38 PM, Paul Stewart <[EMAIL PROTECTED]>
wrote:
>
>>  Hi folks...
>>
>>  I have read the archives and looking for input specific to my
>estimated
>>  requirements:
>>
>  > Want to index about 100 million public webpages.  Space and
bandwidth
>>  are not a problem - coming up with the right hardware and keeping
the
>>  cost down is my goal.
>>
>>  I would estimate only 1-2 searches per second at least during the
>first
>>  hardware phase.
>>
>>  With that in mind I'm trying to figure out whether to use a couple
of
>>  larger Dell servers or a bunch of small single CPU, 1 Gig RAM, 160
GB
>>  hard drive type of machines....
>>
>>  Anyone share what they are using for hardware for about 100 million
>>  webpages and their search result times etc??  Realworld is important
>to
>>  me and being able to scale is important....
>  >
>>  Thanks,
>>
>>  Paul
>>
>>
>>
>>
>>
>>
>>
>>
>>
>-----------------------------------------------------------------------
-
>----
>>
>>  "The information transmitted is intended only for the person or
entity
>to
>>  which it is addressed and contains confidential and/or privileged
>material.
>>  If you received this in error, please contact the sender immediately
>and
>>  then destroy this transmission, including all attachments, without
>copying,
>>  distributing or disclosing same. Thank you."
>>
>
>
>
>
>-----------------------------------------------------------------------
-----
>
>"The information transmitted is intended only for the person or
>entity to which it is addressed and contains confidential and/or
>privileged material. If you received this in error, please contact
>the sender immediately and then destroy this transmission, including
>all attachments, without copying, distributing or disclosing same.
>Thank you."


--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"




----------------------------------------------------------------------------

"The information transmitted is intended only for the person or entity to which 
it is addressed and contains confidential and/or privileged material. If you 
received this in error, please contact the sender immediately and then destroy 
this transmission, including all attachments, without copying, distributing or 
disclosing same. Thank you."

Reply via email to