Hilkiah Lavinier wrote:
Thanks for the explanation.  One further question, if I merge the indexes dir 
(indices) into a directory called index, should I load the index directory or 
the indexes directory into RAM?  A follow up question is that would 
nutch/tomcat use the index directory if it is present over the indexes 
directory?

You would want to index in RAM if that is merged. I believe if box index and indexes are in the same directory, index is used. Just to be safe you may want to move or rename the indexes directory.


Secondly, is there any difference between the nightly builds and the svn 
version?  I was able to build (using ant) the svn version but could NOT build 
the nightly build (#334), which according to hudsen is the last successful 
build.  The failure was due to error :


I don't know what is going on with the nightly. I always am building from SVN.

Dennis Kubes

Buildfile: build.xml

init:

BUILD FAILED
/home/hilkiah/nutch-2008-01-20_10-49-31/build.xml:61: Specify at least one 
source--a file or resource collection.

Total time: 1 second
[EMAIL PROTECTED]:/home/hilkiah/nutch-2008-01-20_10-49-31# ant package
Buildfile: build.xml

init:

BUILD FAILED
/home/hilkiah/nutch-2008-01-20_10-49-31/build.xml:61: Specify at least one 
source--a file or resource collection.

Total time: 1 second
[EMAIL PROTECTED]:/home/hilkiah/nutch-2008-01-20_10-49-31#
Regards,
Hilkiah G. Lavinier MEng (Hons), ACGI 6 Winston Lane, Goodwill, Roseau, Dominica Mbl: (767) 275 3382
Hm : (767) 440 3924
Fax: (767) 440 4991
VoIP USA: (646) 432 4487
Email: [EMAIL PROTECTED]
Email: [EMAIL PROTECTED]
IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
IM: ICQ #8978201  / AOL hilkiah21

----- Original Message ----
From: Dennis Kubes <[EMAIL PROTECTED]>
To: [email protected]
Sent: Sunday, January 20, 2008 9:59:24 AM
Subject: Re: distributed search servers


Here is a link to a previous posting on the hadoop list about how we go
about our setup:

http://www.mail-archive.com/[email protected]/msg10088.html

Long story short, create a tempfs (which is a ram file system) and
stick on the indexes part (not contents or linkdb) into memory. This will increase performance 10x if not more. I don't see much performance improvement of putting the nutch site into memory (although I guess you could), as servlets (jsp) are already in memory. Currently we are testing 5M page indexes on 8G 1U boxes using a PAE kernel.

Dennis Kubes

Hilkiah Lavinier wrote:
Thanks for the quick response.

Dennis, I'm not sure how to change the setting in the NutchBean,
 however I set the variable int hitsPerSite in search.jsp instead.
On a performance note, do you recommend loading the indexes directory
 in ram (tmpfs on linux) to reduce IO and increase performance?  I
 guess it depends on how large the index is and how much ram is available,
 however it sounds like a too good to be true method of squeezing out
 extra performance from a nutch web server.  Your thoughts pls.

Regards,
Hilkiah G. Lavinier MEng (Hons), ACGI 6 Winston Lane, Goodwill, Roseau, Dominica Mbl: (767) 275 3382
Hm : (767) 440 3924
Fax: (767) 440 4991
VoIP USA: (646) 432 4487
Email: [EMAIL PROTECTED]
Email: [EMAIL PROTECTED]
IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
IM: ICQ #8978201  / AOL hilkiah21

----- Original Message ----
From: Dennis Kubes <[EMAIL PROTECTED]>
To: [email protected]
Sent: Saturday, January 19, 2008 7:24:03 PM
Subject: Re: distributed search servers




Hilkiah Lavinier wrote:
Hi all,

Have a distributed search issue I need some advice on.  The scenario
 is that I have tomcat running off one server and two nutch search
 servers running off two other machines (so 3 machines in total).
  I've setup
 the nutch war to correctly call the search servers and they respond.
  Problem is I get duplicate results.  Now I have the same
 data/information from the crawl copied on both machines so the crawl
 data is
 replicated on both machines.
Questions:
1) how do I prevent the duplicate response? If I start a third
 search
 server I only get two duplicate responses so it doesn't seem to
 increase with the number of search servers

In your query or in NutchBean set the hitsPerSite=1, here is an
 example:

Duplicates:
http://search.isc.swlabs.org/search.jsp?lang=en&query=java

No Duplicates:

 http://search.isc.swlabs.org/search.jsp?lang=en&query=java&hitsPerSite=1
This is based on hostname so for instance java.net and www.java.net
will be considered different even though they are the same. The latter problem has not been corrected yet in Nutch, but we are working on
 it.
2) does tomcat wait for ALL search servers to respond before
 displaying the query result or does it display the result as soon as
 one server
 responds?

Yes, to a timeout value.  If one goes down it will slow down the
 entire
search cluster.

3) in terms of load sharing, what is the best approach for
 distributed search servers?

If you are looking at a round-robin sort of load balancing I would
say
two nutch servers hitting different search servers with replicated content fronted by an apache server or hardware load balancer. Remember that the entire search can still be up even if one or more search servers fail. I would worry more about clustering the front end
 search
website than load balancing the search servers but it all depends on what your goal is. For a www search we don't care if a few of the search servers are down as long as the search is functional.

Dennis Kubes


Any help would be greatly appreciated!

Thanks,

Hilkiah G. Lavinier MEng (Hons), ACGI 6 Winston Lane, Goodwill, Roseau, Dominica Mbl: (767) 275 3382
Hm : (767) 440 3924
Fax: (767) 440 4991
VoIP USA: (646) 432 4487
Email: [EMAIL PROTECTED]
Email: [EMAIL PROTECTED]
IM: Yahoo hilkiah / MSN [EMAIL PROTECTED]
IM: ICQ #8978201  / AOL hilkiah21





  
____________________________________________________________________________________
Looking for last minute shopping deals? Find them fast with Yahoo! Search.
 http://tools.search.yahoo.com/newsearch/category.php?category=shopping





 
____________________________________________________________________________________
Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs






      
____________________________________________________________________________________
Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs

Reply via email to