Thanks for the explanation.  One further question, if I merge the indexes dir 
(indices) into a directory called index, should I load the index directory or 
the indexes directory into RAM?  A follow up question is that would 
nutch/tomcat use the index directory if it is present over the indexes 
directory?

Secondly, is there any difference between the nightly builds and the svn 
version?  I was able to build (using ant) the svn version but could NOT build 
the nightly build (#334), which according to hudsen is the last successful 
build.  The failure was due to error :

Buildfile: build.xml

init:

BUILD FAILED
/home/hilkiah/nutch-2008-01-20_10-49-31/build.xml:61: Specify at least one 
source--a file or resource collection.

Total time: 1 second
[EMAIL PROTECTED]:/home/hilkiah/nutch-2008-01-20_10-49-31# ant package
Buildfile: build.xml

init:

BUILD FAILED
/home/hilkiah/nutch-2008-01-20_10-49-31/build.xml:61: Specify at least one 
source--a file or resource collection.

Total time: 1 second
[EMAIL PROTECTED]:/home/hilkiah/nutch-2008-01-20_10-49-31# 

Regards,
 
Hilkiah G. Lavinier MEng (Hons), ACGI 
6 Winston Lane, 
Goodwill, 
Roseau, Dominica 
Mbl: (767) 275 3382
Hm : (767) 440 3924
Fax: (767) 440 4991
VoIP USA: (646) 432 4487
 
Email: [EMAIL PROTECTED]
Email: [EMAIL PROTECTED]
IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
IM: ICQ #8978201  / AOL hilkiah21

----- Original Message ----
From: Dennis Kubes <[EMAIL PROTECTED]>
To: [email protected]
Sent: Sunday, January 20, 2008 9:59:24 AM
Subject: Re: distributed search servers


Here is a link to a previous posting on the hadoop list about how we go
 
about our setup:

http://www.mail-archive.com/[email protected]/msg10088.html

Long story short, create a tempfs (which is a ram file system) and
 stick 
on the indexes part (not contents or linkdb) into memory.  This will 
increase performance 10x if not more.  I don't see much performance 
improvement of putting the nutch site into memory (although I guess you
 
could), as servlets (jsp) are already in memory.  Currently we are 
testing 5M page indexes on 8G 1U boxes using a PAE kernel.

Dennis Kubes

Hilkiah Lavinier wrote:
> Thanks for the quick response.
> 
> Dennis, I'm not sure how to change the setting in the NutchBean,
 however I set the variable int hitsPerSite in search.jsp instead.
> 
> On a performance note, do you recommend loading the indexes directory
 in ram (tmpfs on linux) to reduce IO and increase performance?  I
 guess it depends on how large the index is and how much ram is available,
 however it sounds like a too good to be true method of squeezing out
 extra performance from a nutch web server.  Your thoughts pls.
> 
> 
> Regards,
>  
> Hilkiah G. Lavinier MEng (Hons), ACGI 
> 6 Winston Lane, 
> Goodwill, 
> Roseau, Dominica 
> Mbl: (767) 275 3382
> Hm : (767) 440 3924
> Fax: (767) 440 4991
> VoIP USA: (646) 432 4487
>  
> Email: [EMAIL PROTECTED]
> Email: [EMAIL PROTECTED]
> IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
> IM: ICQ #8978201  / AOL hilkiah21
> 
> ----- Original Message ----
> From: Dennis Kubes <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Saturday, January 19, 2008 7:24:03 PM
> Subject: Re: distributed search servers
> 
> 
> 
> 
> Hilkiah Lavinier wrote:
>> Hi all,
>>
>> Have a distributed search issue I need some advice on.  The scenario
>  is that I have tomcat running off one server and two nutch search
>  servers running off two other machines (so 3 machines in total).
  I've setup
>  the nutch war to correctly call the search servers and they respond.
>   Problem is I get duplicate results.  Now I have the same
>  data/information from the crawl copied on both machines so the crawl
 data is
>  replicated on both machines.
>> Questions:
>> 1) how do I prevent the duplicate response? If I start a third
 search
>  server I only get two duplicate responses so it doesn't seem to
>  increase with the number of search servers
> 
> In your query or in NutchBean set the hitsPerSite=1, here is an
>  example:
> 
> Duplicates:
> http://search.isc.swlabs.org/search.jsp?lang=en&query=java
> 
> No Duplicates:
>
 http://search.isc.swlabs.org/search.jsp?lang=en&query=java&hitsPerSite=1
> 
> This is based on hostname so for instance java.net and www.java.net
>  will 
> be considered different even though they are the same.  The latter 
> problem has not been corrected yet in Nutch, but we are working on
 it.
> 
>> 2) does tomcat wait for ALL search servers to respond before
>  displaying the query result or does it display the result as soon as
 one server
>  responds?
> 
> Yes, to a timeout value.  If one goes down it will slow down the
 entire
>  
> search cluster.
> 
>> 3) in terms of load sharing, what is the best approach for
>  distributed search servers?
> 
> If you are looking at a round-robin sort of load balancing I would
 say 
> two nutch servers hitting different search servers with replicated 
> content fronted by an apache server or hardware load balancer.
>   Remember 
> that the entire search can still be up even if one or more search 
> servers fail.  I would worry more about clustering the front end
 search
>  
> website than load balancing the search servers but it all depends on 
> what your goal is.  For a www search we don't care if a few of the 
> search servers are down as long as the search is functional.
> 
> Dennis Kubes
> 
> 
>> Any help would be greatly appreciated!
>>
>> Thanks,
>>
>> Hilkiah G. Lavinier MEng (Hons), ACGI 
>> 6 Winston Lane, 
>> Goodwill, 
>> Roseau, Dominica 
>> Mbl: (767) 275 3382
>> Hm : (767) 440 3924
>> Fax: (767) 440 4991
>> VoIP USA: (646) 432 4487
>>  
>> Email: [EMAIL PROTECTED]
>> Email: [EMAIL PROTECTED]
>> IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
>> IM: ICQ #8978201  / AOL hilkiah21
>>
>>
>>
>>
>>
>>      
>
  
____________________________________________________________________________________
>> Looking for last minute shopping deals?  
>> Find them fast with Yahoo! Search.
>  
 http://tools.search.yahoo.com/newsearch/category.php?category=shopping
> 
> 
> 
> 
> 
> 
>      
 
____________________________________________________________________________________
> Never miss a thing.  Make Yahoo your home page. 
> http://www.yahoo.com/r/hs






      
____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs

Reply via email to