Re: Hardware Specifications

Otis Gospodnetic Thu, 12 Jun 2008 22:54:37 -0700

Not sure any more.  All this time I was thinking about search, not about 
MapReduce jobs.  Two different beasts.  I think it just so happens that Hadoop 
launches separate JVMs to separate jobs, but the reason for that is not to 
maximize the use of the resources (But what is it?  Job isolation?)


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: Sean Dean <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Friday, June 13, 2008 12:52:53 AM
> Subject: Re: Hardware Specifications
> 
> So in the most simple of contexts your sort of agreeing with me. Running 
> multiple nutch processes on a multi-core processor is more efficient then 
> running one single process on heavily scaled hardware.
>  
> Am i correct with this statement?
> 
> 
> ----- Original Message ----
> From: Otis Gospodnetic 
> To: [email protected]
> Sent: Friday, June 13, 2008 12:16:38 AM
> Subject: Re: Hardware Specifications
> 
> I'm not sure -- I try to avoid running single Nutch job at a time, as I find 
> overlapping is more efficient.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> ----- Original Message ----
> > From: Sean Dean 
> > To: [email protected]
> > Sent: Thursday, June 12, 2008 12:37:19 PM
> > Subject: Re: Hardware Specifications
> > 
> > I see.
> >  
> > What happens with the utilization when only one job is running, does it 
> > stay 
> > about equal at a lower overall percentage or does it move predominately to 
> > one 
> 
> > core?
> > 
> > 
> > 
> > ----- Original Message ----
> > From: "[EMAIL PROTECTED]" 
> > To: [email protected]
> > Sent: Thursday, June 12, 2008 12:17:10 AM
> > Subject: Re: Hardware Specifications
> > 
> > Hm, hm.
> > 
> > I can't speak for Nutch's search (don't have it running at the moment), but 
> > I 
> am 
> > looking at a cluster that is running a fetch job and a generate job 
> concurrently 
> > and I see both cores on the dual-core server being utilized about equally.
> > 
> > 
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > 
> > 
> > ----- Original Message ----
> > > From: Sean Dean 
> > > To: [email protected]
> > > Sent: Saturday, June 7, 2008 3:52:33 AM
> > > Subject: Re: Hardware Specifications
> > > 
> > > Hey Otis,
> > >  
> > > I will first disclose that the OS im using for my Nutch implementation is 
> > > FreeBSD 7 (amd64) and my differ from a standard 64-bit Linux 
> > > distribution. 
> The 
> > 
> > > JDK however is your standard SUN 1.5.0-14 64-bit package.
> > >  
> > > I find that the JVM does not treat Nutch as something that's truly 
> > > multithreaded. Which ever task you ask it to do, be it serve results, 
> > > fetch, 
> 
> > > inject, update, etc. it will always peg one core and not use anything 
> > > else 
> > > (sometimes it will share processing on another core but this is just the 
> > garbage 
> > > collection thread inside the JVM).
> > >  
> > > Having smaller indexes (15-20M) on multiple nutch instances (with 4GB or 
> > > so 
> of 
> > 
> > > RAM) doesn't fix this limitation, but it does cheat in that each instance 
> runs 
> > 
> > > as its own independent JVM and as such the OS will execute operations on 
> > > the 
> 
> > > core which has the lowest utilization via the scheduler (in my case 
> FreeBSD's 
> > > ULE) for each instance.
> > >  
> > > When you think about it this type of setup scales very well horizontally, 
> much 
> > 
> > > like Nutch/Hadoop itself. I find creating one huge index on the same 
> > > machine 
> 
> > and 
> > > giving it everything it has in terms of resources has diminishing 
> > > returns, 
> and 
> > 
> > > as my example points out never uses it all anyway.
> > >  
> > > One negative about this setup though is detailed in NUTCH-92. This issue 
> alone 
> > 
> > > kills any attempt to scale your search engine for "main stream" 
> > > commercial 
> > > success (e.g. Google).
> > > 
> > > 
> > > 
> > > ----- Original Message ----
> > > From: "[EMAIL PROTECTED]" 
> > > To: [email protected]
> > > Sent: Friday, June 6, 2008 12:20:41 PM
> > > Subject: Re: Hardware Specifications
> > > 
> > > Dan, you left out one important "bit" - this is a 64-bit machine?
> > > 
> > > Sean, out of curiosity... is this really better than running a single JVM 
> > > on 
> a 
> > 
> > > multi-core 64-bit machine with 32GB of RAM than running a single JVM 
> instance, 
> > 
> > > single Nutch instance, and letting the OS switch between cores?
> > > 
> > > 
> > > As for fetching/indexing/searching - you probably don't want to do this 
> > > on 
> the 
> > 
> > > same set of machines.  Use a set of machines for fetching/indexing, and a 
> set 
> > of 
> > > machines for serving search requests.
> > > 
> > > Otis
> > > --
> > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > > 
> > > 
> > > ----- Original Message ----
> > > > From: Sean Dean 
> > > > To: [email protected]
> > > > Sent: Thursday, June 5, 2008 3:45:41 PM
> > > > Subject: Re: Hardware Specifications
> > > > 
> > > > Another idea is to setup 8 seperate nutch instances on the same server, 
> each 
> > 
> > > > with its own 20M index.
> > > >  
> > > > The idea behind this is that one-core per application will be used, 
> although 
> > 
> > > its 
> > > > not pegged and the RAM is used in ~4GB chunks (JVM setting) for each 
> > instance.
> > > >  
> > > > This would be used for serving results only though, you would have to 
> > disable 
> > > > part or all of this when in fetching mode but it would give you 160M 
> > > > pages 
> 
> > and 
> > > 
> > > > still very good speeds (about 4-5 per second or more as other factors 
> > > > come 
> 
> > > into 
> > > > play). Keep in mind we use 8 hard drives, each associated with its own 
> > > instance 
> > > > on the server but as long as the RAID FC setup you have is very fast 
> > > > the 
> > > results 
> > > > should be comparible (maybe even faster).
> > > > 
> > > > 
> > > > ----- Original Message ----
> > > > From: Dennis Kubes 
> > > > To: [email protected]
> > > > Sent: Thursday, June 5, 2008 2:38:04 PM
> > > > Subject: Re: Hardware Specifications
> > > > 
> > > > In memory index 15M.  On disk index, slower but still doable where 
> > > > response time isn't critical, ~350M pages maybe more.
> > > > 
> > > > Dennis
> > > > 
> > > > Dan Segel wrote:
> > > > > We have a server that has 30TB of hard drive space connected through 
> > fiber,
> > > > > 2 quad core 2.5ghz, and 32gb of ram.  If fetching 5 searches per 
> > > > > second 
> > how
> > > > > many million indexed pages do you think we can achieve?
> > > > >

Re: Hardware Specifications

Reply via email to