Re: Hardware Specifications

ogjunk-nutch Wed, 11 Jun 2008 21:18:45 -0700

Hm, hm.

I can't speak for Nutch's search (don't have it running at the moment), but I 
am looking at a cluster that is running a fetch job and a generate job 
concurrently and I see both cores on the dual-core server being utilized about 
equally.



Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: Sean Dean <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Saturday, June 7, 2008 3:52:33 AM
> Subject: Re: Hardware Specifications
> 
> Hey Otis,
>  
> I will first disclose that the OS im using for my Nutch implementation is 
> FreeBSD 7 (amd64) and my differ from a standard 64-bit Linux distribution. 
> The 
> JDK however is your standard SUN 1.5.0-14 64-bit package.
>  
> I find that the JVM does not treat Nutch as something that's truly 
> multithreaded. Which ever task you ask it to do, be it serve results, fetch, 
> inject, update, etc. it will always peg one core and not use anything else 
> (sometimes it will share processing on another core but this is just the 
> garbage 
> collection thread inside the JVM).
>  
> Having smaller indexes (15-20M) on multiple nutch instances (with 4GB or so 
> of 
> RAM) doesn't fix this limitation, but it does cheat in that each instance 
> runs 
> as its own independent JVM and as such the OS will execute operations on the 
> core which has the lowest utilization via the scheduler (in my case FreeBSD's 
> ULE) for each instance.
>  
> When you think about it this type of setup scales very well horizontally, 
> much 
> like Nutch/Hadoop itself. I find creating one huge index on the same machine 
> and 
> giving it everything it has in terms of resources has diminishing returns, 
> and 
> as my example points out never uses it all anyway.
>  
> One negative about this setup though is detailed in NUTCH-92. This issue 
> alone 
> kills any attempt to scale your search engine for "main stream" commercial 
> success (e.g. Google).
> 
> 
> 
> ----- Original Message ----
> From: "[EMAIL PROTECTED]" 
> To: [email protected]
> Sent: Friday, June 6, 2008 12:20:41 PM
> Subject: Re: Hardware Specifications
> 
> Dan, you left out one important "bit" - this is a 64-bit machine?
> 
> Sean, out of curiosity... is this really better than running a single JVM on 
> a 
> multi-core 64-bit machine with 32GB of RAM than running a single JVM 
> instance, 
> single Nutch instance, and letting the OS switch between cores?
> 
> 
> As for fetching/indexing/searching - you probably don't want to do this on 
> the 
> same set of machines.  Use a set of machines for fetching/indexing, and a set 
> of 
> machines for serving search requests.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> ----- Original Message ----
> > From: Sean Dean 
> > To: [email protected]
> > Sent: Thursday, June 5, 2008 3:45:41 PM
> > Subject: Re: Hardware Specifications
> > 
> > Another idea is to setup 8 seperate nutch instances on the same server, 
> > each 
> > with its own 20M index.
> >  
> > The idea behind this is that one-core per application will be used, 
> > although 
> its 
> > not pegged and the RAM is used in ~4GB chunks (JVM setting) for each 
> > instance.
> >  
> > This would be used for serving results only though, you would have to 
> > disable 
> > part or all of this when in fetching mode but it would give you 160M pages 
> > and 
> 
> > still very good speeds (about 4-5 per second or more as other factors come 
> into 
> > play). Keep in mind we use 8 hard drives, each associated with its own 
> instance 
> > on the server but as long as the RAID FC setup you have is very fast the 
> results 
> > should be comparible (maybe even faster).
> > 
> > 
> > ----- Original Message ----
> > From: Dennis Kubes 
> > To: [email protected]
> > Sent: Thursday, June 5, 2008 2:38:04 PM
> > Subject: Re: Hardware Specifications
> > 
> > In memory index 15M.  On disk index, slower but still doable where 
> > response time isn't critical, ~350M pages maybe more.
> > 
> > Dennis
> > 
> > Dan Segel wrote:
> > > We have a server that has 30TB of hard drive space connected through 
> > > fiber,
> > > 2 quad core 2.5ghz, and 32gb of ram.  If fetching 5 searches per second 
> > > how
> > > many million indexed pages do you think we can achieve?
> > >

Re: Hardware Specifications

Reply via email to