Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by DennisKubes:
http://wiki.apache.org/nutch/NutchHadoopTutorial

The comment on the change is:
Clarifications on distributed searching

------------------------------------------------------------------------------
  
--------------------------------------------------------------------------------
  Although not really the topic of this tutorial, distributed searching needs 
to be addressed.  In a production system, you would create your indexes and 
corresponding databases (i.e. crawldb) using the DFS and MapReduce, but you 
would search them using local filesystems on dedicated search servers for speed 
and to avoid network overhead.
  
- Briefly here is how you would setup distributed searching.  Inside of the 
tomcate WEB-INF/classes directory in the nutch-site.xml file you would point 
the searcher.dir property to a file that contains a search-servers.txt file.  
The search servers.txt file would look like this.
+ Briefly here is how you would setup distributed searching.  Inside of the 
tomcat WEB-INF/classes directory in the nutch-site.xml file you would point the 
searcher.dir property to a file that contains a search-servers.txt file.  The 
search servers.txt file would look like this.
  
  {{{
  devcluster01 1234
@@ -525, +525 @@

  
  Each line contains a machine name and port that represents a search server.  
This tells the website to connect to search servers on those machines at those 
ports.
  
+ On each of the search servers, since we are searching local directories to 
search, you would need to make sure that the filesystem in the nutch-site.xml 
file is pointing to local.  One of the problems that I can across is that I was 
using the same nutch distribution to act as a slave node for DFS and MR as I 
was using to run the distributed search server.  The problem with this was that 
when the distributed search server started up it was looking in the DFS for the 
files to read.  It couldn't find them and I would get log messages saying x 
servers with 0 segments.  
+ 
+ I found it easiest to create another nutch distribution in a separate folder. 
 I would then start the distributed search server from this separate 
distribution.  I just used the default nutch-site.xml and hadoop-site.xml files 
which have no configuration.  This defaults the filesystem to local and the 
distributed search server is able to find the files it needs on the local box.  
+ 
+ Whatever way you want to do it, if your index is on the local filesystem then 
the configuration needs to be pointed to use the local filesystem as show 
below.  This is usually set in the hadoop-site.xml file.
+ 
+ {{{
+ <property>
+  <name>fs.default.name</name>
+   <value>local</value>
+   <description>The name of the default file system.  Either the
+   literal string "local" or a host:port for DFS.</description>
+ </property>
+ }}}
+ 
  On each of the search servers you would use the startup the distributed 
search server by using the nutch server command like this:
  
  {{{
  bin/nutch server 1234 /d01/local/crawled
  }}}
  
- The arguments are the port to start the server on which must correspond with 
what you put into the search-servers.txt file and the local directory that is 
the parent of the index folder. Once the distributed search servers are started 
on each machine you can startup the website.  Searching should then happen 
normally with the exception of search results being pulled from the distributed 
search server indexes.
+ The arguments are the port to start the server on which must correspond with 
what you put into the search-servers.txt file and the local directory that is 
the parent of the index folder. Once the distributed search servers are started 
on each machine you can startup the website.  Searching should then happen 
normally with the exception of search results being pulled from the distributed 
search server indexes.  In the logs on the search website (usually catalina.out 
file), you should see messages telling you the number of servers and segments 
the website is attached to and searching.  This will allow you to know if you 
have your setup correct.
  
- There is no command to shutdown the distributed search server process, you 
will simply have to kill it by hand.  The tomcat logs for the website should 
show how many servers and segments it is connected to at any one time.  The 
good news is that the website polls the servers in its search-servers.txt file 
to constantly check if they are up so you can shut down a single distributed 
search server, change out its index and bring it back up and the website will 
reconnect automatically.  This was they entire search is never down at any one 
point in time, only specific parts of the index would be down.
+ There is no command to shutdown the distributed search server process, you 
will simply have to kill it by hand.  The good news is that the website polls 
the servers in its search-servers.txt file to constantly check if they are up 
so you can shut down a single distributed search server, change out its index 
and bring it back up and the website will reconnect automatically.  This was 
they entire search is never down at any one point in time, only specific parts 
of the index would be down.
  
  In a production environment searching is the biggest cost both in machines 
and electricity.  The reason is that once an index piece gets beyond about 2 
million pages it takes too much time to read from the disk so you can have a 
100 million page index on a single machine no matter how big the hard disk is.  
Fortunately using the distributed searching you can have multiple dedicated 
search servers each with their own piece of the index that are searched in 
parallel.  This allow very large index system to be searched efficiently.
  

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

Reply via email to