Hey Otis,
 
I will first disclose that the OS im using for my Nutch implementation is 
FreeBSD 7 (amd64) and my differ from a standard 64-bit Linux distribution. The 
JDK however is your standard SUN 1.5.0-14 64-bit package.
 
I find that the JVM does not treat Nutch as something that's truly 
multithreaded. Which ever task you ask it to do, be it serve results, fetch, 
inject, update, etc. it will always peg one core and not use anything else 
(sometimes it will share processing on another core but this is just the 
garbage collection thread inside the JVM).
 
Having smaller indexes (15-20M) on multiple nutch instances (with 4GB or so of 
RAM) doesn't fix this limitation, but it does cheat in that each instance runs 
as its own independent JVM and as such the OS will execute operations on the 
core which has the lowest utilization via the scheduler (in my case FreeBSD's 
ULE) for each instance.
 
When you think about it this type of setup scales very well horizontally, much 
like Nutch/Hadoop itself. I find creating one huge index on the same machine 
and giving it everything it has in terms of resources has diminishing returns, 
and as my example points out never uses it all anyway.
 
One negative about this setup though is detailed in NUTCH-92. This issue alone 
kills any attempt to scale your search engine for "main stream" commercial 
success (e.g. Google).



----- Original Message ----
From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
To: [email protected]
Sent: Friday, June 6, 2008 12:20:41 PM
Subject: Re: Hardware Specifications

Dan, you left out one important "bit" - this is a 64-bit machine?

Sean, out of curiosity... is this really better than running a single JVM on a 
multi-core 64-bit machine with 32GB of RAM than running a single JVM instance, 
single Nutch instance, and letting the OS switch between cores?


As for fetching/indexing/searching - you probably don't want to do this on the 
same set of machines.  Use a set of machines for fetching/indexing, and a set 
of machines for serving search requests.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: Sean Dean <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Thursday, June 5, 2008 3:45:41 PM
> Subject: Re: Hardware Specifications
> 
> Another idea is to setup 8 seperate nutch instances on the same server, each 
> with its own 20M index.
>  
> The idea behind this is that one-core per application will be used, although 
> its 
> not pegged and the RAM is used in ~4GB chunks (JVM setting) for each instance.
>  
> This would be used for serving results only though, you would have to disable 
> part or all of this when in fetching mode but it would give you 160M pages 
> and 
> still very good speeds (about 4-5 per second or more as other factors come 
> into 
> play). Keep in mind we use 8 hard drives, each associated with its own 
> instance 
> on the server but as long as the RAID FC setup you have is very fast the 
> results 
> should be comparible (maybe even faster).
> 
> 
> ----- Original Message ----
> From: Dennis Kubes 
> To: [email protected]
> Sent: Thursday, June 5, 2008 2:38:04 PM
> Subject: Re: Hardware Specifications
> 
> In memory index 15M.  On disk index, slower but still doable where 
> response time isn't critical, ~350M pages maybe more.
> 
> Dennis
> 
> Dan Segel wrote:
> > We have a server that has 30TB of hard drive space connected through fiber,
> > 2 quad core 2.5ghz, and 32gb of ram.  If fetching 5 searches per second how
> > many million indexed pages do you think we can achieve?
> > 

Reply via email to