In many ways i learned from trial and error and had i seen someone mention what is best practices i probably would have been better informed as to what would/wouldn't work.
Ofcourse nutch is very much a moving target and the hardware of today is drastically superior to that of even just a year ago, in performance, cost & reliability. For me load distribution across many nodes and a storage network rather than big server is the only way. Both in price & performance. The learning curve to me was the same either way. :) -byron -----Original Message----- From: "Chirag Chaman" <[EMAIL PROTECTED]> To: <[email protected]> Date: Fri, 15 Apr 2005 16:15:02 -0400 Subject: RE: [Nutch-general] RE: Nutch - new public server > Bryon, > > While I agree with your analysis/feedback I think it can be a little > daunting for a first timer (or someone who has not got their hands > dirty > with nutch). > > Yes, Resin is more stable -- but tomcat works out of the box and Resin > issues can be hard to fix given that not of whole lot of people use > Resin/Nutch combo. > > Distributed DB/NDFS - In theory this is definitely the way to go. In > practice, this again can be daunting for someone with not a lot of > experience -- when something goes wrong chasing down the problem can be > time > consuming. Now, I'm sure there are folks like you for whom it was a not > very > difficult -- but, in general my feeling is right now the distributed > WebDB > requires very good (experienced) troubleshooting skills. > > That being said, I think distribution is the only way to go for a large > implementation. While our (centralized) approach is scalable and SUPER > EASY > to manage, it comes at a price ($$$) - requiring expensive H/W for the > DB > servers. We use a customized WebDB in conjunction with a commercial DB > software, to break out the fetch/crawl processes. This parallelizes the > processes and makes computation a lot faster. In theory, this is a lot > like > the Map Reduce discussion that has been going on, except that most of > this > we do external to Nutch allowing us to use the existing code with > little > modification. Given that we do very frequent fetches, this works for us > -- > someone with a more batched, less frequent process, another strategy > would > work better. > > I whole heartedly agree with the XML approach to displaying results -- > I saw > a lot of talk about this over the last few days and hope this is built > soon. > > CC- > > > > -----Original Message----- > From: Byron Miller [mailto:[EMAIL PROTECTED] > Sent: Friday, April 15, 2005 11:36 AM > To: [email protected] > Subject: RE: [Nutch-general] RE: Nutch - new public server > > To add from my experiences: > > I've preferred Resin (stability & performance) > > I always go for more ram than more servers. It's cheaper in the long > run > when it comes to man hours and service as well as MTBF for your > hardware. > > Use Squid to proxy/load balance your java servers. This helped > alleviate > much of my traffic. Smart squid policies & configurations can offload > queries from even hitting your servers if they're THAT common. (which > seems > to happen more often then not). > > I'm leaning much more to using distributedWebDB and multiple ndfs > servers > for storage. i'm done with pilling terrabytes on a single server. Once > you > have millions of pages in your db and your trying to fetch -> import > -> analyze -> fetch cycle to keep up you will see what i mean :) > > Try converting search.jsp to xml, process that through xlst or into > another > process so your search processes can complete quickly without any extra > page > rendering you may have going on. (allows you to incorporate other > results, > insert sponsored feeds and do all sorts of nifty stuff as well) > > -----Original Message----- > From: "Chirag Chaman" <[EMAIL PROTECTED]> > To: <[email protected]> > Date: Fri, 15 Apr 2005 10:52:37 -0400 > Subject: RE: [Nutch-general] RE: Nutch - new public server > > > > > > >1. "Souped-up" DB server - Dual CPU, 4 GB Ram (min) RAID 5 or 10, > 1-2 > > >NICS > > > > > > > > This is the 'fetcher' server? > > > > This is you fetch/crawler/indexer -- create the final segments here, > > then move them to the search server. That way if a search server goes > > down, simply move the segment to another server. > > > > >2. Basic Search Servers - Single/Dual CPU, Maximum RAM, Single > > IDE/SATA > > >drive (or 2 for redundancy) > > > > > > > > These are the 'fetched' segment backup and search servers? > > If I have 10 Million pages / server, this is good thing: 2kbyte * 10 > = > > 20 GByte RAM? Or there is enought 10 GByte, and later put more if it > > need? > > > > Actually, you'll want 20GB ram if you're trying to displace MSN as > the > > fastest search engine. Believe it or not, Lucene is EXTREMELY fast > > even when reading from disk (whose the genius who wrote that > > software?). I would keep about 4-8MM/pages per server and give about > > 1GB per million. Let the Linux file caching system do it's magic. > > After the first 20-30 searches, things should be pretty fast. Take a > > look at filangy.com - search is pretty fast and we're hitting the > > disk. The only drawback is that from disk we see things starting to > > slow down if more that 5-6 searches happen simultaneously. That's 5-6 > > per second -- and we usually improve by adding another server. Given > > that 1GB stick are much cheper than 2GB sticks, oyu'll find adding > > another cheap server is cheaper that adding more RAM. And the2GB > > sticks are suported is more high-end server -- so cheap hardware > > cannot be user anymore. > > > > > > >3. Basic Web Servers - Single/Dual CPU, Medium RAM > > > > > > > > In this boxs I will put 1-2 GByte RAM. > > I would like put frontend Apache2 and mod_jk2, this is bottleneck, or > > in > > this way I will tunning somethings: static images, web pages etc. > > caching? Or better way Tomcats directly to the WEB? > > > > > > Go with tomcat straight for now -- you don't want the search pages to > > take > > the Apache/mod_jk2 hit everytime. Later you can split up the static > > pages in > > a separate site that can be on apache. For loading images, make a > > separate > > url image.domain.com and load those from there. > > >
