RE: [Nutch-general] RE: Nutch - new public server

Chirag Chaman Fri, 15 Apr 2005 13:15:02 -0700

Bryon,

While I agree with your analysis/feedback I think it can be a little
daunting for a first timer (or someone who has not got their hands dirty
with nutch).


Yes, Resin is more stable -- but tomcat works out of the box and Resin
issues can be hard to fix given that not of whole lot of people use
Resin/Nutch combo.

Distributed DB/NDFS - In theory this is definitely the way to go. In
practice, this again can be daunting for someone with not a lot of
experience -- when something goes wrong chasing down the problem can be time
consuming. Now, I'm sure there are folks like you for whom it was a not very
difficult -- but, in general my feeling is right now the distributed WebDB
requires very good (experienced) troubleshooting skills.

That being said, I think distribution is the only way to go for a large
implementation. While our (centralized) approach is scalable and SUPER EASY
to manage, it comes at a price ($$$) - requiring expensive H/W for the DB
servers. We use a customized WebDB in conjunction with a commercial DB
software, to break out the fetch/crawl processes. This parallelizes the
processes and makes computation a lot faster. In theory, this is a lot like
the Map Reduce discussion that has been going on, except that most of this
we do external to Nutch allowing us to use the existing code with little
modification. Given that we do very frequent fetches, this works for us --
someone with a more batched, less frequent process, another strategy would
work better.  

I whole heartedly agree with the XML approach to displaying results -- I saw
a lot of talk about this over the last few days and hope this is built soon.

CC-

 

-----Original Message-----
From: Byron Miller [mailto:[EMAIL PROTECTED] 
Sent: Friday, April 15, 2005 11:36 AM
To: [email protected]
Subject: RE: [Nutch-general] RE: Nutch - new public server

To add from my experiences:

I've preferred Resin (stability & performance)

I always go for more ram than more servers. It's cheaper in the long run
when it comes to man hours and service as well as MTBF for your hardware.

Use Squid to proxy/load balance your java servers. This helped alleviate
much of my traffic. Smart squid policies & configurations can offload
queries from even hitting your servers if they're THAT common. (which seems
to happen more often then not).

I'm leaning much more to using distributedWebDB and multiple ndfs servers
for storage. i'm done with pilling terrabytes on a single server.  Once you
have millions of pages in your db and your trying to fetch -> import
-> analyze -> fetch cycle to keep up you will see what i mean :)

Try converting search.jsp to xml, process that through xlst or into another
process so your search processes can complete quickly without any extra page
rendering you may have going on. (allows you to incorporate other results,
insert sponsored feeds and do all sorts of nifty stuff as well)

-----Original Message-----
From: "Chirag Chaman" <[EMAIL PROTECTED]>
To: <[email protected]>
Date: Fri, 15 Apr 2005 10:52:37 -0400
Subject: RE: [Nutch-general] RE: Nutch - new public server

> >
> >1. "Souped-up" DB server - Dual CPU, 4 GB Ram (min) RAID 5 or 10, 1-2 
> >NICS
> >  
> >
> This is the 'fetcher' server?
> 
> This is you fetch/crawler/indexer -- create the final segments here, 
> then move them to the search server. That way if a search server goes 
> down, simply move the segment to another server.
> 
> >2. Basic Search Servers - Single/Dual CPU, Maximum RAM, Single
> IDE/SATA
> >drive (or 2 for redundancy)
> >  
> >
> These are the 'fetched' segment backup and search servers?
> If I have 10 Million pages / server, this is good thing: 2kbyte * 10 = 
> 20 GByte RAM? Or there is enought 10 GByte, and later put more if it 
> need?
> 
> Actually, you'll want 20GB ram if you're trying to displace MSN as the 
> fastest search engine. Believe it or not, Lucene is EXTREMELY fast 
> even when reading from disk (whose the genius who wrote that 
> software?). I would keep about 4-8MM/pages per server and give about 
> 1GB per million. Let the Linux file caching system do it's magic. 
> After the first 20-30 searches, things should be pretty fast. Take a 
> look at filangy.com - search is pretty fast and we're hitting the 
> disk. The only drawback is that from disk we see things starting to 
> slow down if more that 5-6 searches happen simultaneously. That's 5-6 
> per second -- and we usually improve by adding another server. Given 
> that 1GB stick are much cheper than 2GB sticks, oyu'll find adding 
> another cheap server is cheaper that adding more RAM. And the2GB 
> sticks are suported is more high-end server -- so cheap hardware 
> cannot be user anymore.
>   
> 
> >3. Basic Web Servers - Single/Dual CPU, Medium RAM
> >  
> >
> In this boxs I will put 1-2 GByte RAM.
> I would like put frontend Apache2 and mod_jk2, this is bottleneck, or
> in
> this way I will tunning somethings: static images, web pages etc. 
> caching? Or better way Tomcats directly to the WEB?
> 
> 
> Go with tomcat straight for now -- you don't want the search pages to
> take
> the Apache/mod_jk2 hit everytime. Later you can split up the static
> pages in
> a separate site that can be on apache. For loading images, make a
> separate
> url image.domain.com and load those from there.
>

RE: [Nutch-general] RE: Nutch - new public server

Reply via email to