Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by DennisKubes: http://wiki.apache.org/nutch/NutchHadoopTutorial The comment on the change is: Clarifications on distributed searching ------------------------------------------------------------------------------ -------------------------------------------------------------------------------- Although not really the topic of this tutorial, distributed searching needs to be addressed. In a production system, you would create your indexes and corresponding databases (i.e. crawldb) using the DFS and MapReduce, but you would search them using local filesystems on dedicated search servers for speed and to avoid network overhead. - Briefly here is how you would setup distributed searching. Inside of the tomcate WEB-INF/classes directory in the nutch-site.xml file you would point the searcher.dir property to a file that contains a search-servers.txt file. The search servers.txt file would look like this. + Briefly here is how you would setup distributed searching. Inside of the tomcat WEB-INF/classes directory in the nutch-site.xml file you would point the searcher.dir property to a file that contains a search-servers.txt file. The search servers.txt file would look like this. {{{ devcluster01 1234 @@ -525, +525 @@ Each line contains a machine name and port that represents a search server. This tells the website to connect to search servers on those machines at those ports. + On each of the search servers, since we are searching local directories to search, you would need to make sure that the filesystem in the nutch-site.xml file is pointing to local. One of the problems that I can across is that I was using the same nutch distribution to act as a slave node for DFS and MR as I was using to run the distributed search server. The problem with this was that when the distributed search server started up it was looking in the DFS for the files to read. It couldn't find them and I would get log messages saying x servers with 0 segments. + + I found it easiest to create another nutch distribution in a separate folder. I would then start the distributed search server from this separate distribution. I just used the default nutch-site.xml and hadoop-site.xml files which have no configuration. This defaults the filesystem to local and the distributed search server is able to find the files it needs on the local box. + + Whatever way you want to do it, if your index is on the local filesystem then the configuration needs to be pointed to use the local filesystem as show below. This is usually set in the hadoop-site.xml file. + + {{{ + <property> + <name>fs.default.name</name> + <value>local</value> + <description>The name of the default file system. Either the + literal string "local" or a host:port for DFS.</description> + </property> + }}} + On each of the search servers you would use the startup the distributed search server by using the nutch server command like this: {{{ bin/nutch server 1234 /d01/local/crawled }}} - The arguments are the port to start the server on which must correspond with what you put into the search-servers.txt file and the local directory that is the parent of the index folder. Once the distributed search servers are started on each machine you can startup the website. Searching should then happen normally with the exception of search results being pulled from the distributed search server indexes. + The arguments are the port to start the server on which must correspond with what you put into the search-servers.txt file and the local directory that is the parent of the index folder. Once the distributed search servers are started on each machine you can startup the website. Searching should then happen normally with the exception of search results being pulled from the distributed search server indexes. In the logs on the search website (usually catalina.out file), you should see messages telling you the number of servers and segments the website is attached to and searching. This will allow you to know if you have your setup correct. - There is no command to shutdown the distributed search server process, you will simply have to kill it by hand. The tomcat logs for the website should show how many servers and segments it is connected to at any one time. The good news is that the website polls the servers in its search-servers.txt file to constantly check if they are up so you can shut down a single distributed search server, change out its index and bring it back up and the website will reconnect automatically. This was they entire search is never down at any one point in time, only specific parts of the index would be down. + There is no command to shutdown the distributed search server process, you will simply have to kill it by hand. The good news is that the website polls the servers in its search-servers.txt file to constantly check if they are up so you can shut down a single distributed search server, change out its index and bring it back up and the website will reconnect automatically. This was they entire search is never down at any one point in time, only specific parts of the index would be down. In a production environment searching is the biggest cost both in machines and electricity. The reason is that once an index piece gets beyond about 2 million pages it takes too much time to read from the disk so you can have a 100 million page index on a single machine no matter how big the hard disk is. Fortunately using the distributed searching you can have multiple dedicated search servers each with their own piece of the index that are searched in parallel. This allow very large index system to be searched efficiently. ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-cvs mailing list Nutch-cvs@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-cvs