One other question, what are you guys using to measure nutch performance?

Regards,
 
Hilkiah G. Lavinier MEng (Hons), ACGI 
6 Winston Lane, 
Goodwill, 
Roseau, Dominica 
Mbl: (767) 275 3382
Hm : (767) 440 3924
Fax: (767) 440 4991
VoIP USA: (646) 432 4487
 
Email: [EMAIL PROTECTED]
Email: [EMAIL PROTECTED]
IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
IM: ICQ #8978201  / AOL hilkiah21

----- Original Message ----
From: Dennis Kubes <[EMAIL PROTECTED]>
To: [email protected]
Sent: Sunday, January 20, 2008 9:59:24 AM
Subject: Re: distributed search servers


Here is a link to a previous posting on the hadoop list about how we go
 
about our setup:

http://www.mail-archive.com/[email protected]/msg10088.html

Long story short, create a tempfs (which is a ram file system) and
 stick 
on the indexes part (not contents or linkdb) into memory.  This will 
increase performance 10x if not more.  I don't see much performance 
improvement of putting the nutch site into memory (although I guess you
 
could), as servlets (jsp) are already in memory.  Currently we are 
testing 5M page indexes on 8G 1U boxes using a PAE kernel.

Dennis Kubes

Hilkiah Lavinier wrote:
> Thanks for the quick response.
> 
> Dennis, I'm not sure how to change the setting in the NutchBean,
 however I set the variable int hitsPerSite in search.jsp instead.
> 
> On a performance note, do you recommend loading the indexes directory
 in ram (tmpfs on linux) to reduce IO and increase performance?  I
 guess it depends on how large the index is and how much ram is available,
 however it sounds like a too good to be true method of squeezing out
 extra performance from a nutch web server.  Your thoughts pls.
> 
> 
> Regards,
>  
> Hilkiah G. Lavinier MEng (Hons), ACGI 
> 6 Winston Lane, 
> Goodwill, 
> Roseau, Dominica 
> Mbl: (767) 275 3382
> Hm : (767) 440 3924
> Fax: (767) 440 4991
> VoIP USA: (646) 432 4487
>  
> Email: [EMAIL PROTECTED]
> Email: [EMAIL PROTECTED]
> IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
> IM: ICQ #8978201  / AOL hilkiah21
> 
> ----- Original Message ----
> From: Dennis Kubes <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Saturday, January 19, 2008 7:24:03 PM
> Subject: Re: distributed search servers
> 
> 
> 
> 
> Hilkiah Lavinier wrote:
>> Hi all,
>>
>> Have a distributed search issue I need some advice on.  The scenario
>  is that I have tomcat running off one server and two nutch search
>  servers running off two other machines (so 3 machines in total).
  I've setup
>  the nutch war to correctly call the search servers and they respond.
>   Problem is I get duplicate results.  Now I have the same
>  data/information from the crawl copied on both machines so the crawl
 data is
>  replicated on both machines.
>> Questions:
>> 1) how do I prevent the duplicate response? If I start a third
 search
>  server I only get two duplicate responses so it doesn't seem to
>  increase with the number of search servers
> 
> In your query or in NutchBean set the hitsPerSite=1, here is an
>  example:
> 
> Duplicates:
> http://search.isc.swlabs.org/search.jsp?lang=en&query=java
> 
> No Duplicates:
>
 http://search.isc.swlabs.org/search.jsp?lang=en&query=java&hitsPerSite=1
> 
> This is based on hostname so for instance java.net and www.java.net
>  will 
> be considered different even though they are the same.  The latter 
> problem has not been corrected yet in Nutch, but we are working on
 it.
> 
>> 2) does tomcat wait for ALL search servers to respond before
>  displaying the query result or does it display the result as soon as
 one server
>  responds?
> 
> Yes, to a timeout value.  If one goes down it will slow down the
 entire
>  
> search cluster.
> 
>> 3) in terms of load sharing, what is the best approach for
>  distributed search servers?
> 
> If you are looking at a round-robin sort of load balancing I would
 say 
> two nutch servers hitting different search servers with replicated 
> content fronted by an apache server or hardware load balancer.
>   Remember 
> that the entire search can still be up even if one or more search 
> servers fail.  I would worry more about clustering the front end
 search
>  
> website than load balancing the search servers but it all depends on 
> what your goal is.  For a www search we don't care if a few of the 
> search servers are down as long as the search is functional.
> 
> Dennis Kubes
> 
> 
>> Any help would be greatly appreciated!
>>
>> Thanks,
>>
>> Hilkiah G. Lavinier MEng (Hons), ACGI 
>> 6 Winston Lane, 
>> Goodwill, 
>> Roseau, Dominica 
>> Mbl: (767) 275 3382
>> Hm : (767) 440 3924
>> Fax: (767) 440 4991
>> VoIP USA: (646) 432 4487
>>  
>> Email: [EMAIL PROTECTED]
>> Email: [EMAIL PROTECTED]
>> IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
>> IM: ICQ #8978201  / AOL hilkiah21
>>
>>
>>
>>
>>
>>      
>
  
____________________________________________________________________________________
>> Looking for last minute shopping deals?  
>> Find them fast with Yahoo! Search.
>  
 http://tools.search.yahoo.com/newsearch/category.php?category=shopping
> 
> 
> 
> 
> 
> 
>      
 
____________________________________________________________________________________
> Never miss a thing.  Make Yahoo your home page. 
> http://www.yahoo.com/r/hs






      
____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ 

Reply via email to