Hey Otis, I will first disclose that the OS im using for my Nutch implementation is FreeBSD 7 (amd64) and my differ from a standard 64-bit Linux distribution. The JDK however is your standard SUN 1.5.0-14 64-bit package. I find that the JVM does not treat Nutch as something that's truly multithreaded. Which ever task you ask it to do, be it serve results, fetch, inject, update, etc. it will always peg one core and not use anything else (sometimes it will share processing on another core but this is just the garbage collection thread inside the JVM). Having smaller indexes (15-20M) on multiple nutch instances (with 4GB or so of RAM) doesn't fix this limitation, but it does cheat in that each instance runs as its own independent JVM and as such the OS will execute operations on the core which has the lowest utilization via the scheduler (in my case FreeBSD's ULE) for each instance. When you think about it this type of setup scales very well horizontally, much like Nutch/Hadoop itself. I find creating one huge index on the same machine and giving it everything it has in terms of resources has diminishing returns, and as my example points out never uses it all anyway. One negative about this setup though is detailed in NUTCH-92. This issue alone kills any attempt to scale your search engine for "main stream" commercial success (e.g. Google).
----- Original Message ---- From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> To: [email protected] Sent: Friday, June 6, 2008 12:20:41 PM Subject: Re: Hardware Specifications Dan, you left out one important "bit" - this is a 64-bit machine? Sean, out of curiosity... is this really better than running a single JVM on a multi-core 64-bit machine with 32GB of RAM than running a single JVM instance, single Nutch instance, and letting the OS switch between cores? As for fetching/indexing/searching - you probably don't want to do this on the same set of machines. Use a set of machines for fetching/indexing, and a set of machines for serving search requests. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Sean Dean <[EMAIL PROTECTED]> > To: [email protected] > Sent: Thursday, June 5, 2008 3:45:41 PM > Subject: Re: Hardware Specifications > > Another idea is to setup 8 seperate nutch instances on the same server, each > with its own 20M index. > > The idea behind this is that one-core per application will be used, although > its > not pegged and the RAM is used in ~4GB chunks (JVM setting) for each instance. > > This would be used for serving results only though, you would have to disable > part or all of this when in fetching mode but it would give you 160M pages > and > still very good speeds (about 4-5 per second or more as other factors come > into > play). Keep in mind we use 8 hard drives, each associated with its own > instance > on the server but as long as the RAID FC setup you have is very fast the > results > should be comparible (maybe even faster). > > > ----- Original Message ---- > From: Dennis Kubes > To: [email protected] > Sent: Thursday, June 5, 2008 2:38:04 PM > Subject: Re: Hardware Specifications > > In memory index 15M. On disk index, slower but still doable where > response time isn't critical, ~350M pages maybe more. > > Dennis > > Dan Segel wrote: > > We have a server that has 30TB of hard drive space connected through fiber, > > 2 quad core 2.5ghz, and 32gb of ram. If fetching 5 searches per second how > > many million indexed pages do you think we can achieve? > >
