Hm, hm. I can't speak for Nutch's search (don't have it running at the moment), but I am looking at a cluster that is running a fetch job and a generate job concurrently and I see both cores on the dual-core server being utilized about equally.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Sean Dean <[EMAIL PROTECTED]> > To: [email protected] > Sent: Saturday, June 7, 2008 3:52:33 AM > Subject: Re: Hardware Specifications > > Hey Otis, > > I will first disclose that the OS im using for my Nutch implementation is > FreeBSD 7 (amd64) and my differ from a standard 64-bit Linux distribution. > The > JDK however is your standard SUN 1.5.0-14 64-bit package. > > I find that the JVM does not treat Nutch as something that's truly > multithreaded. Which ever task you ask it to do, be it serve results, fetch, > inject, update, etc. it will always peg one core and not use anything else > (sometimes it will share processing on another core but this is just the > garbage > collection thread inside the JVM). > > Having smaller indexes (15-20M) on multiple nutch instances (with 4GB or so > of > RAM) doesn't fix this limitation, but it does cheat in that each instance > runs > as its own independent JVM and as such the OS will execute operations on the > core which has the lowest utilization via the scheduler (in my case FreeBSD's > ULE) for each instance. > > When you think about it this type of setup scales very well horizontally, > much > like Nutch/Hadoop itself. I find creating one huge index on the same machine > and > giving it everything it has in terms of resources has diminishing returns, > and > as my example points out never uses it all anyway. > > One negative about this setup though is detailed in NUTCH-92. This issue > alone > kills any attempt to scale your search engine for "main stream" commercial > success (e.g. Google). > > > > ----- Original Message ---- > From: "[EMAIL PROTECTED]" > To: [email protected] > Sent: Friday, June 6, 2008 12:20:41 PM > Subject: Re: Hardware Specifications > > Dan, you left out one important "bit" - this is a 64-bit machine? > > Sean, out of curiosity... is this really better than running a single JVM on > a > multi-core 64-bit machine with 32GB of RAM than running a single JVM > instance, > single Nutch instance, and letting the OS switch between cores? > > > As for fetching/indexing/searching - you probably don't want to do this on > the > same set of machines. Use a set of machines for fetching/indexing, and a set > of > machines for serving search requests. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > ----- Original Message ---- > > From: Sean Dean > > To: [email protected] > > Sent: Thursday, June 5, 2008 3:45:41 PM > > Subject: Re: Hardware Specifications > > > > Another idea is to setup 8 seperate nutch instances on the same server, > > each > > with its own 20M index. > > > > The idea behind this is that one-core per application will be used, > > although > its > > not pegged and the RAM is used in ~4GB chunks (JVM setting) for each > > instance. > > > > This would be used for serving results only though, you would have to > > disable > > part or all of this when in fetching mode but it would give you 160M pages > > and > > > still very good speeds (about 4-5 per second or more as other factors come > into > > play). Keep in mind we use 8 hard drives, each associated with its own > instance > > on the server but as long as the RAID FC setup you have is very fast the > results > > should be comparible (maybe even faster). > > > > > > ----- Original Message ---- > > From: Dennis Kubes > > To: [email protected] > > Sent: Thursday, June 5, 2008 2:38:04 PM > > Subject: Re: Hardware Specifications > > > > In memory index 15M. On disk index, slower but still doable where > > response time isn't critical, ~350M pages maybe more. > > > > Dennis > > > > Dan Segel wrote: > > > We have a server that has 30TB of hard drive space connected through > > > fiber, > > > 2 quad core 2.5ghz, and 32gb of ram. If fetching 5 searches per second > > > how > > > many million indexed pages do you think we can achieve? > > >
