[Nutch-cvs] [Nutch Wiki] Update of "NutchHadoopTutorial" by DennisKubes

Apache Wiki Thu, 24 Aug 2006 10:14:16 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by DennisKubes:
http://wiki.apache.org/nutch/NutchHadoopTutorial

The comment on the change is:
Fixed searching and added distributed searching instructions

------------------------------------------------------------------------------
    /search
      (nutch installation goes here)
    /filesystem
+   /local (used for local directory for searching)
    /home
      (nutch user's home directory)
    /tomcat    (only on one server for searching)
@@ -132, +133 @@

  mkdir /nutch
  mkdir /nutch/search
  mkdir /nutch/filesystem
+ mkdir /nutch/local
  mkdir /nutch/home
  
  groupadd users
@@ -454, +456 @@

  You can also startup new terminals into the slave machine and tail the log 
files to see detailed output for that slave node.  The crawl will probably take 
a while to complete.  When it is done we are ready to do the search.
  
  
- == Performing a Search with Hadoop and Nutch ==
+ == Performing a Search ==
  
--------------------------------------------------------------------------------
+ To perform a search on the index we just created within the distributed 
filesystem we need to do two things.  First we need to pull the index to a 
local filesystem and second we need to setup and configure the nutch war file.  
Although technically possible, it is not advisable to do searching using the 
distributed filesystem.  
+ 
+ The DFS is great for holding the results of the MapReduce processes including 
the completed index, but for searching it simply takes too long.  In a 
production system you are going to want to create the indexes using the 
MapReduce system and store the result on the DFS.  Then you are going to want 
to copy those indexes to a local filesystem for searching.  If the indexes are 
too big (i.e. you have a 100 million page index), you are going to want to 
break the index up into multiple pieces (1-2 million pages each), copy the 
index pieces to local filesystems from the DFS and have multiple search servers 
read from those local index pieces.  A full distributed search setup is the 
topic of another tutorial but for now realize that you don't want to search 
using DFS, you want to search using local filesystems.  
+ 
+ Once the index has been created on the DFS you can use the hadoop copyToLocal 
command to move it to the local file system as such.
+ 
+ {{{
+ bin/hadoop dfs -copyToLocal crawled /d01/local/
+ }}}
+ 
+ Your crawl directory should have an index directory which should contain the 
actual index files.  Later when working with Nutch and Hadoop if you have an 
indexes directory with folders such as part-xxxxx inside of it you can use the 
nutch merge command to merge segment indexes into a single index.  The search 
website when pointed to local will look for a directory in which there is an 
index folder that contains merged index files or an indexes folder that 
contains segment indexes.  This can be a tricky part because your search 
website can be working properly but if it doesn't find the indexes, all 
searches will return nothing.
+ 
- To perform a search on the index we just created within the distributed 
filesystem we first need to setup and configure the nutch war file.  If you 
setup the tomcat server as we stated earlier then you should have a tomcat 
installation under /nutch/tomcat and in the webapps directory you should have a 
folder called ROOT with the nutch war file unzipped inside of it.  Now we just 
need to configure the application to use the distributed filesystem for 
searching.  We do this by editing the hadoop-site.xml file under the 
WEB-INF/classes directory.  Use the following commands:
+ If you setup the tomcat server as we stated earlier then you should have a 
tomcat installation under /nutch/tomcat and in the webapps directory you should 
have a folder called ROOT with the nutch war file unzipped inside of it.  Now 
we just need to configure the application to use the distributed filesystem for 
searching.  We do this by editing the hadoop-site.xml file under the 
WEB-INF/classes directory.  Use the following commands:
  
  {{{
  cd /nutch/tomcat/webapps/ROOT/WEB-INF/classes
- vi hadoop-site.xml
+ vi nutch-site.xml
  }}}
  
- Below is an template hadoop-site.xml file:
+ Below is an template nutch-site.xml file:
  
  {{{
  <?xml version="1.0"?>
@@ -473, +487 @@

  
    <property>
      <name>fs.default.name</name>
-     <value>devcluster01:9000</value>
+     <value>local</value>
    </property>
  
    <property>
      <name>searcher.dir</name>
-     <value>crawled</value>
+     <value>/d01/local/crawled</value>
    </property>
  
  </configuration>
  }}}
  
- The fs.default.name property as before is pointed to our name node.
+ The fs.default.name property is now pointed locally for searching the local 
index.  Understand that at this point we are not using the DFS or MapReduce to 
do the searching, all of it is on a local machine.
  
- The searcher.dir directory is the directory that we specified in the 
distributed filesystem under which the index was stored.  In our crawl command 
earlier we used the crawled directory.
+ The searcher.dir directory is the directory where the index and resulting 
database are stored on the local filesystem.  In our crawl command earlier we 
used the crawled directory which stored the results in crawled on the DFS.  
Then we copied the crawled folder to our /d01/local directory on the local 
fileystem.  So here we point this property to /d01/local/crawled.  The 
directory which it points to should contain not just the index directory but 
also the linkdb, segments, etc.  All of these different databases are used by 
the search.  This is why we copied over the crawled directory and not just the 
index directory.
  
- Once the hadoop-site.xml file is edited then the application should be ready 
to go.  You can start tomcat with the following command:
+ Once the nutch-site.xml file is edited then the application should be ready 
to go.  You can start tomcat with the following command:
  
  {{{
  cd /nutch/tomcat
  bin/startup.sh
  }}}
  
- Then point you browser to http://devcluster01:8080 (your master node) to see 
the Nutch search web application.  If everything has been configured correctly 
then you should be able to enter queries and get results.
+ Then point you browser to http://devcluster01:8080 (your search server) to 
see the Nutch search web application.  If everything has been configured 
correctly then you should be able to enter queries and get results.  If the 
website is working but you are getting no results it probably has to do with 
the index directory not being found. The searcher.dir property must be pointed 
to the parent of the index directory.  That parent must also contain the 
segments, linkdb, and crawldb folders from the crawl.  The index folder must be 
named index and contain merged segment indexes, meaning the index files are in 
the index directory and not in a directory below index named part-xxxx for 
example, or the index directory must be named indexes and contain segment 
indexes of the name part-xxxxx which hold the index files.  I have had better 
luck with merged indexes than with segment indexes.
  
+ == Distributed Searching ==
+ 
--------------------------------------------------------------------------------
+ Although not really the topic of this tutorial, distributed searching needs 
to be addressed.  In a production system, you would create your indexes and 
corresponding databases (i.e. crawldb) using the DFS and MapReduce, but you 
would search them using local filesystems on dedicated search servers for speed 
and to avoid network overhead.
+ 
+ Briefly here is how you would setup distributed searching.  Inside of the 
tomcate WEB-INF/classes directory in the nutch-site.xml file you would point 
the searcher.dir property to a file that contains a search-servers.txt file.  
The search servers.txt file would look like this.
+ 
+ {{{
+ devcluster01 1234
+ devcluster01 5678
+ devcluster02 9101
+ }}}
+ 
+ Each line contains a machine name and port that represents a search server.  
This tells the website to connect to search servers on those machines at those 
ports.
+ 
+ On each of the search servers you would use the startup the distributed 
search server by using the nutch server command like this:
+ 
+ {{{
+ bin/nutch server 1234 /d01/local/crawled
+ }}}
+ 
+ The arguments are the port to start the server on which must correspond with 
what you put into the search-servers.txt file and the local directory that is 
the parent of the index folder. Once the distributed search servers are started 
on each machine you can startup the website.  Searching should then happen 
normally with the exception of search results being pulled from the distributed 
search server indexes.
+ 
+ There is no command to shutdown the distributed search server process, you 
will simply have to kill it by hand.  The tomcat logs for the website should 
show how many servers and segments it is connected to at any one time.  The 
good news is that the website polls the servers in its search-servers.txt file 
to constantly check if they are up so you can shut down a single distributed 
search server, change out its index and bring it back up and the website will 
reconnect automatically.  This was they entire search is never down at any one 
point in time, only specific parts of the index would be down.
+ 
+ In a production environment searching is the biggest cost both in machines 
and electricity.  The reason is that once an index piece gets beyond about 2 
million pages it takes too much time to read from the disk so you can have a 
100 million page index on a single machine no matter how big the hard disk is.  
Fortunately using the distributed searching you can have multiple dedicated 
search servers each with their own piece of the index that are searched in 
parallel.  This allow very large index system to be searched efficiently.
+ 
+ Doing the math, a 100 million page system would take about 50 dedicated 
search servers to serve 20+ queries per second.  One way to get around having 
to have so many machines is by using multi-processor machine with multiple 
disks running multiple search servers each using a separate disk and index.  
Going down this route you can cut machine cost down by as much as 50% and 
electricity costs down by as much as 75%.  A multi-disk machine can't handle 
the same number of queries per second as a dedicated single disk machine but 
the number of index pages it can handle is significantly greater so it averages 
out to be much more efficient.
  
  == Rsyncing Code to Slaves ==
  
--------------------------------------------------------------------------------
@@ -549, +590 @@

  
  If you have any comments or suggestions feel free to email them to me at 
[EMAIL PROTECTED]  If you have questions about Nutch or Hadoop they should be 
addressed to their respective mailing lists.  Below are general resources that 
are helpful with operating and developing Nutch and Hadoop.
  
+ == Updates ==
+ 
--------------------------------------------------------------------------------
+  * I don't use rsync to sync code between the servers any more.  Now I am 
using expect scripts and python scripts to manage and automate the system.
+ 
+  * I use distributed searching with 1-2 million pages per index piece.  We 
now have servers with multiple processors and multiple disks (4 per machine) 
running multiple search servers (1 per disk) to decrease cost and power 
requirements.  With this a single server holding 8 million pages can serve 10 
queries a second constant.
+ 
+ 
  == Resources ==
  
--------------------------------------------------------------------------------
  Google MapReduce Paper:
@@ -583, +631 @@

    * When you first start up hadoop, there's a warning in the namenode log, 
"dfs.StateChange - DIR* FSDirectory.unprotectedDelete: failed to remove 
e:/dev/nutch-0.8/filesystem/mapreduce/.system.crc because it does not exist" - 
You can ignore that.
    * If you get errors like, "failed to create file [...] on client [foo] 
because target-length is 0, below MIN_REPLICATION (1)" this means a block could 
not be distributed. Most likely there is no datanode running, or the datanode 
has some severe problem (like the lock problem mentioned above).
  
-  * The tutorial says you should point the searcher to the DFS namenode. This 
seems to be pretty inefficient; in a real distributed case you would need to 
set up distributed searchers and avoid network I/O for the DFS. It would be 
nice if this could be addressed in a future version of this tutorial.  
- 

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

[Nutch-cvs] [Nutch Wiki] Update of "NutchHadoopTutorial" by DennisKubes

Reply via email to