RE: [Nutch-general] RE: Nutch - new public server

Byron Miller Fri, 15 Apr 2005 13:53:28 -0700

In many ways i learned from trial and error and had i seen someone mention
what is best practices i probably would have been better informed as to
what would/wouldn't work.


Ofcourse nutch is very much a moving target and the hardware of today is
drastically superior to that of even just a year ago, in performance, cost
& reliability.

For me load distribution across many nodes and a storage network rather
than big server is the only way.  Both in price & performance.

The learning curve to me was the same either way. :)

-byron

-----Original Message-----
From: "Chirag Chaman" <[EMAIL PROTECTED]>
To: <[email protected]>
Date: Fri, 15 Apr 2005 16:15:02 -0400
Subject: RE: [Nutch-general] RE: Nutch - new public server

> Bryon,
> 
> While I agree with your analysis/feedback I think it can be a little
> daunting for a first timer (or someone who has not got their hands
> dirty
> with nutch).
> 
> Yes, Resin is more stable -- but tomcat works out of the box and Resin
> issues can be hard to fix given that not of whole lot of people use
> Resin/Nutch combo.
> 
> Distributed DB/NDFS - In theory this is definitely the way to go. In
> practice, this again can be daunting for someone with not a lot of
> experience -- when something goes wrong chasing down the problem can be
> time
> consuming. Now, I'm sure there are folks like you for whom it was a not
> very
> difficult -- but, in general my feeling is right now the distributed
> WebDB
> requires very good (experienced) troubleshooting skills.
> 
> That being said, I think distribution is the only way to go for a large
> implementation. While our (centralized) approach is scalable and SUPER
> EASY
> to manage, it comes at a price ($$$) - requiring expensive H/W for the
> DB
> servers. We use a customized WebDB in conjunction with a commercial DB
> software, to break out the fetch/crawl processes. This parallelizes the
> processes and makes computation a lot faster. In theory, this is a lot
> like
> the Map Reduce discussion that has been going on, except that most of
> this
> we do external to Nutch allowing us to use the existing code with
> little
> modification. Given that we do very frequent fetches, this works for us
> --
> someone with a more batched, less frequent process, another strategy
> would
> work better.  
> 
> I whole heartedly agree with the XML approach to displaying results --
> I saw
> a lot of talk about this over the last few days and hope this is built
> soon.
> 
> CC-
> 
>  
> 
> -----Original Message-----
> From: Byron Miller [mailto:[EMAIL PROTECTED] 
> Sent: Friday, April 15, 2005 11:36 AM
> To: [email protected]
> Subject: RE: [Nutch-general] RE: Nutch - new public server
> 
> To add from my experiences:
> 
> I've preferred Resin (stability & performance)
> 
> I always go for more ram than more servers. It's cheaper in the long
> run
> when it comes to man hours and service as well as MTBF for your
> hardware.
> 
> Use Squid to proxy/load balance your java servers. This helped
> alleviate
> much of my traffic. Smart squid policies & configurations can offload
> queries from even hitting your servers if they're THAT common. (which
> seems
> to happen more often then not).
> 
> I'm leaning much more to using distributedWebDB and multiple ndfs
> servers
> for storage. i'm done with pilling terrabytes on a single server.  Once
> you
> have millions of pages in your db and your trying to fetch -> import
> -> analyze -> fetch cycle to keep up you will see what i mean :)
> 
> Try converting search.jsp to xml, process that through xlst or into
> another
> process so your search processes can complete quickly without any extra
> page
> rendering you may have going on. (allows you to incorporate other
> results,
> insert sponsored feeds and do all sorts of nifty stuff as well)
> 
> -----Original Message-----
> From: "Chirag Chaman" <[EMAIL PROTECTED]>
> To: <[email protected]>
> Date: Fri, 15 Apr 2005 10:52:37 -0400
> Subject: RE: [Nutch-general] RE: Nutch - new public server
> 
> > >
> > >1. "Souped-up" DB server - Dual CPU, 4 GB Ram (min) RAID 5 or 10,
> 1-2 
> > >NICS
> > >  
> > >
> > This is the 'fetcher' server?
> > 
> > This is you fetch/crawler/indexer -- create the final segments here, 
> > then move them to the search server. That way if a search server goes
> > down, simply move the segment to another server.
> > 
> > >2. Basic Search Servers - Single/Dual CPU, Maximum RAM, Single
> > IDE/SATA
> > >drive (or 2 for redundancy)
> > >  
> > >
> > These are the 'fetched' segment backup and search servers?
> > If I have 10 Million pages / server, this is good thing: 2kbyte * 10
> = 
> > 20 GByte RAM? Or there is enought 10 GByte, and later put more if it 
> > need?
> > 
> > Actually, you'll want 20GB ram if you're trying to displace MSN as
> the 
> > fastest search engine. Believe it or not, Lucene is EXTREMELY fast 
> > even when reading from disk (whose the genius who wrote that 
> > software?). I would keep about 4-8MM/pages per server and give about 
> > 1GB per million. Let the Linux file caching system do it's magic. 
> > After the first 20-30 searches, things should be pretty fast. Take a 
> > look at filangy.com - search is pretty fast and we're hitting the 
> > disk. The only drawback is that from disk we see things starting to 
> > slow down if more that 5-6 searches happen simultaneously. That's 5-6
> > per second -- and we usually improve by adding another server. Given 
> > that 1GB stick are much cheper than 2GB sticks, oyu'll find adding 
> > another cheap server is cheaper that adding more RAM. And the2GB 
> > sticks are suported is more high-end server -- so cheap hardware 
> > cannot be user anymore.
> >   
> > 
> > >3. Basic Web Servers - Single/Dual CPU, Medium RAM
> > >  
> > >
> > In this boxs I will put 1-2 GByte RAM.
> > I would like put frontend Apache2 and mod_jk2, this is bottleneck, or
> > in
> > this way I will tunning somethings: static images, web pages etc. 
> > caching? Or better way Tomcats directly to the WEB?
> > 
> > 
> > Go with tomcat straight for now -- you don't want the search pages to
> > take
> > the Apache/mod_jk2 hit everytime. Later you can split up the static
> > pages in
> > a separate site that can be on apache. For loading images, make a
> > separate
> > url image.domain.com and load those from there.
> > 
>

RE: [Nutch-general] RE: Nutch - new public server

Reply via email to