Thanks for the quick response.
Dennis, I'm not sure how to change the setting in the NutchBean, however I set
the variable int hitsPerSite in search.jsp instead.
On a performance note, do you recommend loading the indexes directory in ram
(tmpfs on linux) to reduce IO and increase performance? I guess it depends on
how large the index is and how much ram is available, however it sounds like a
too good to be true method of squeezing out extra performance from a nutch web
server. Your thoughts pls.
Regards,
Hilkiah G. Lavinier MEng (Hons), ACGI
6 Winston Lane,
Goodwill,
Roseau, Dominica
Mbl: (767) 275 3382
Hm : (767) 440 3924
Fax: (767) 440 4991
VoIP USA: (646) 432 4487
Email: [EMAIL PROTECTED]
Email: [EMAIL PROTECTED]
IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
IM: ICQ #8978201 / AOL hilkiah21
----- Original Message ----
From: Dennis Kubes <[EMAIL PROTECTED]>
To: [email protected]
Sent: Saturday, January 19, 2008 7:24:03 PM
Subject: Re: distributed search servers
Hilkiah Lavinier wrote:
Hi all,
Have a distributed search issue I need some advice on. The scenario
is that I have tomcat running off one server and two nutch search
servers running off two other machines (so 3 machines in total). I've setup
the nutch war to correctly call the search servers and they respond.
Problem is I get duplicate results. Now I have the same
data/information from the crawl copied on both machines so the crawl data is
replicated on both machines.
Questions:
1) how do I prevent the duplicate response? If I start a third search
server I only get two duplicate responses so it doesn't seem to
increase with the number of search servers
In your query or in NutchBean set the hitsPerSite=1, here is an
example:
Duplicates:
http://search.isc.swlabs.org/search.jsp?lang=en&query=java
No Duplicates:
http://search.isc.swlabs.org/search.jsp?lang=en&query=java&hitsPerSite=1
This is based on hostname so for instance java.net and www.java.net
will
be considered different even though they are the same. The latter
problem has not been corrected yet in Nutch, but we are working on it.
2) does tomcat wait for ALL search servers to respond before
displaying the query result or does it display the result as soon as one server
responds?
Yes, to a timeout value. If one goes down it will slow down the entire
search cluster.
3) in terms of load sharing, what is the best approach for
distributed search servers?
If you are looking at a round-robin sort of load balancing I would say
two nutch servers hitting different search servers with replicated
content fronted by an apache server or hardware load balancer.
Remember
that the entire search can still be up even if one or more search
servers fail. I would worry more about clustering the front end search
website than load balancing the search servers but it all depends on
what your goal is. For a www search we don't care if a few of the
search servers are down as long as the search is functional.
Dennis Kubes
Any help would be greatly appreciated!
Thanks,
Hilkiah G. Lavinier MEng (Hons), ACGI
6 Winston Lane,
Goodwill,
Roseau, Dominica
Mbl: (767) 275 3382
Hm : (767) 440 3924
Fax: (767) 440 4991
VoIP USA: (646) 432 4487
Email: [EMAIL PROTECTED]
Email: [EMAIL PROTECTED]
IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
IM: ICQ #8978201 / AOL hilkiah21
____________________________________________________________________________________
Looking for last minute shopping deals?
Find them fast with Yahoo! Search.
http://tools.search.yahoo.com/newsearch/category.php?category=shopping
____________________________________________________________________________________
Never miss a thing. Make Yahoo your home page.
http://www.yahoo.com/r/hs