[Nutch Wiki] Trivial Update of "NutchHadoopTutorial" by LewisJohnMcgibbney

Apache Wiki Mon, 19 Mar 2012 08:11:18 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchHadoopTutorial" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/NutchHadoopTutorial?action=diff&rev1=37&rev2=38

  
  Something interesting to note about the distributed filesystem is that it is 
user specific.  If you store a directory urls under the filesystem with the 
nutch user, it is actually stored as /user/nutch/urls.  What this means to us 
is that the user that does the crawl and stores it in the distributed 
filesystem must also be the user that starts the search, or no results will 
come back.  You can try this yourself by logging in with a different user and 
runing the ls command as shown.  It won't find the directories because is it 
looking under a different directory /user/username instead of /user/nutch.
  
- At this stage it might be beneficial to try out a test crawl.
- 
- From your hadoop home directory execute
- 
- {{{
- hadoop jar nutch-${version}.jar org.apache.nutch.crawl.Crawl urls -dir urls 
-depth 1 -topN 5
- }}}
- 
- As before, you can track progress through your logs, or alternatively 
navigate to the aforementioned Hadoop gui's.
-  
  If everything worked then you are good to add other nodes and start the crawl 
;)
  
  
@@ -390, +380 @@

  
  == Performing a Nutch Crawl ==
  
--------------------------------------------------------------------------------
- Now that we have the the distributed file system up and running we can peform 
our nutch crawl.  In this tutorial we are only going to crawl a single site.  I 
am not as concerned with someone being able to learn the crawling aspect of 
nutch as I am with being able to setup the distributed filesystem and mapreduce.
+ Now that we have the the distributed file system up and running we can peform 
our fully distributed nutch crawl.  In this tutorial we are only going to crawl 
the two sites we did above as in this tutorial we are not as concerned with 
someone being able to learn the crawling aspect of nutch as we are with being 
able to setup the distributed filesystem and mapreduce.
  
- To make sure we crawl only a single site we are going to edit crawl urlfilter 
file as set the filter to only pickup lucene.apache.org:
+ To make sure we crawl only a single site we are going to edit 
regex-urlfilter.txt file as set the filter to crawl *apache.org (this will 
permit nutch.apache.org as well):
  
  {{{
  cd /nutch/search
- vi conf/crawl-urlfilter.txt
+ vi conf/regex-urlfilter.txt
  
  change the line that reads:   +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
  to read:                      +^http://([a-z0-9]*\.)*apache.org/
  }}}
  
- We have already added our urls to the distributed filesystem and we have 
edited our urlfilter so now it is time to begin the crawl.  To start the nutch 
crawl use the following command:
+ We have already added our urls to the distributed filesystem and we have 
edited our urlfilter so now it is time to begin the crawl. To start the nutch 
crawl firstly copy your nutch-${version}.job jar over to $HADOOP_HOME, then use 
the following command:
  
  {{{
- cd /nutch/search
- bin/nutch crawl urlsdir -dir crawl -depth 3
+ cd $HADOOP_HOME
+ hadoop jar nutch-${version}.jar org.apache.nutch.crawl.Crawl urls -dir urls 
-depth 3 -topN 5
  }}}
  
- We are using the nutch crawl command.  The urlsdir is the urls directory that 
we added to the distributed filesystem.  (I've called it "urlsdir" to make it 
clearer that it isn't merely the *file* containing urls). The "-dir crawl" is 
the output directory.  This will also go to the distributed filesystem.  The 
depth is 3 meaning it will only get 3 page links deep.  There are other options 
you can specify, see the command documentation for those options. 
+ We are using the nutch crawl command.  The urls dir is the urls directory 
that we added to the distributed filesystem. The "-dir crawl" is the output 
directory.  This will also go to the distributed filesystem.  The depth is 3 
meaning it will only get 3 page links deep.  There are other options you can 
specify, see the command documentation for those options. 
  
  You should see the crawl startup and see output for jobs running and map and 
reduce percentages.  You can keep track of the jobs by pointing you browser to 
the master name node:
  
@@ -426, +416 @@

  You might want to try some of these commands before doing a search
  
  {{{
+ hadoop jar nutch-${version}.jar org.apache.nutch.crawl.LinkDbReader 
crawldb/linkdb -dump /tmp/linksdir
- bin/nutch readlinkdb crawl/linkdb -dump /tmp/linksdir
- in nutch1.2  linkdb should be chaneged to crawldb ：bin/nutch readlinkdb 
crawl/crawldb -dump /tmp/linksdir
  mkdir /nutch/search/output/
  bin/hadoop dfs -copyToLocal /tmp/linksdir  /nutch/search/output/linksdir
  less /nutch/search/output/linksdir/*
@@ -436, +425 @@

  Or if we want to look at the whole thing as a text file we might try 
  
  {{{
- bin/nutch readdb crawl/crawldb -dump /tmp/entiredump
+ hadoop jar nutch-${version}.jar org.apache.nutch.crawl.LinkDbReader 
crawldb/linkdb -dump /tmp/entiredump
  bin/hadoop dfs -copyToLocal /tmp/entiredump  /nutch/search/output/entiredump
  less /nutch/search/output/entiredump/*
  }}}
  
- == Performing a Search ==
+ == Performing a Search == 
  
--------------------------------------------------------------------------------
+ Quite frankly, this tutorial doesn't aspire to provide detail on the ins and 
out of using Apache Solr, or any other search architecture. If (when running 
your crawl) you were using the crawl command as above, you could merely specify 
the Solr URL(s) you wish to use.  
- To perform a search on the index we just created within the distributed 
filesystem we need to do two things.  First we need to pull the index to a 
local filesystem and second we need to setup and configure the nutch war file.  
Although technically possible, it is not advisable to do searching using the 
distributed filesystem.  
- 
- The DFS is great for holding the results of the MapReduce processes including 
the completed index, but for searching it simply takes too long.  In a 
production system you are going to want to create the indexes using the 
MapReduce system and store the result on the DFS.  Then you are going to want 
to copy those indexes to a local filesystem for searching.  If the indexes are 
too big (i.e. you have a 100 million page index), you are going to want to 
break the index up into multiple pieces (1-2 million pages each), copy the 
index pieces to local filesystems from the DFS and have multiple search servers 
read from those local index pieces.  A full distributed search setup is the 
topic of another tutorial but for now realize that you don't want to search 
using DFS, you want to search using local filesystems.  
- 
- Once the index has been created on the DFS you can use the hadoop copyToLocal 
command to move it to the local file system as such.
- 
- {{{
- bin/hadoop dfs -copyToLocal crawl /d01/local/
- }}}
- 
- Your crawl directory should have an index directory which should contain the 
actual index files.  Later when working with Nutch and Hadoop if you have an 
indexes directory with folders such as part-xxxxx inside of it you can use the 
nutch merge command to merge segment indexes into a single index.  The search 
website when pointed to local will look for a directory in which there is an 
index folder that contains merged index files or an indexes folder that 
contains segment indexes.  This can be a tricky part because your search 
website can be working properly but if it doesn't find the indexes, all 
searches will return nothing.
- 
- If you setup the tomcat server as we stated earlier then you should have a 
tomcat installation under /nutch/tomcat and in the webapps directory you should 
have a folder called ROOT with the nutch war file unzipped inside of it.  Now 
we just need to configure the application to use the distributed filesystem for 
searching.  We do this by editing the hadoop-site.xml file under the 
WEB-INF/classes directory.  Use the following commands:
- 
- {{{
- cd /nutch/tomcat/webapps/ROOT/WEB-INF/classes
- vi nutch-site.xml
- }}}
- 
- Below is an template nutch-site.xml file:
- 
- {{{
- <?xml version="1.0"?>
- <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
- 
- <configuration>
- 
-   <property>
-     <name>fs.default.name</name>
-     <value>local</value>
-   </property>
- 
-   <property>
-     <name>searcher.dir</name>
-     <value>/d01/local/crawl</value>
-   </property>
- 
- </configuration>
- }}}
- 
- The fs.default.name property is now pointed locally for searching the local 
index.  Understand that at this point we are not using the DFS or MapReduce to 
do the searching, all of it is on a local machine.
- 
- The searcher.dir directory is the directory where the index and resulting 
database are stored on the local filesystem.  In our crawl command earlier we 
used the crawl directory which stored the results in "crawl" on the HDFS.  Then 
we copied the crawl folder to our /d01/local directory on the local fileystem.  
So here we point this property to /d01/local/crawl.  The directory which it 
points to should contain not just the index directory but also the linkdb, 
segments, etc.  All of these different databases are used by the search.  This 
is why we copied over the entire crawl directory and not just the index 
directory.
- 
- 
- Once the nutch-site.xml file is edited then the application should be ready 
to go.  You can start tomcat with the following command:
- 
- {{{
- cd /nutch/tomcat
- bin/startup.sh
- }}}
- 
- Then point you browser to http://devcluster01:8080 (your search server) to 
see the Nutch search web application.  If everything has been configured 
correctly then you should be able to enter queries and get results.  If the 
website is working but you are getting no results it probably has to do with 
the index directory not being found. The searcher.dir property must be pointed 
to the parent of the index directory.  That parent must also contain the 
segments, linkdb, and crawldb folders from the crawl.  The index folder must be 
named index and contain merged segment indexes, meaning the index files are in 
the index directory and not in a directory below index named part-xxxx for 
example, or the index directory must be named indexes and contain segment 
indexes of the name part-xxxxx which hold the index files.  I have had better 
luck with merged indexes than with segment indexes.
- 
- == Distributed Searching ==
- 
--------------------------------------------------------------------------------
- Although not really the topic of this tutorial, distributed searching needs 
to be addressed.  In a production system, you would create your indexes and 
corresponding databases (i.e. crawldb) using the DFS and MapReduce, but you 
would search them using local filesystems on dedicated search servers for speed 
and to avoid network overhead.
- 
- Briefly here is how you would setup distributed searching.  Inside of the 
tomcat WEB-INF/classes directory in the nutch-site.xml file you would point the 
searcher.dir property to a file that contains a search-servers.txt file.  The 
search servers.txt file would look like this.
- 
- {{{
- devcluster01 1234
- devcluster01 5678
- devcluster02 9101
- }}}
- 
- Each line contains a machine name and port that represents a search server.  
This tells the website to connect to search servers on those machines at those 
ports.
- 
- On each of the search servers, since we are searching local directories, you 
would need to make sure that the filesystem in the nutch-site.xml file is 
pointing to local.  One of the problems that I came across is that I was using 
the same nutch distribution to act as a slave node for DFS and MR as I was 
using to run the distributed search server.  The problem with this was that 
when the distributed search server started up it was looking in the DFS for the 
files to read.  It couldn't find them and I would get log messages saying x 
servers with 0 segments.  
- 
- I found it easiest to create another nutch distribution in a separate folder. 
 I would then start the distributed search server from this separate 
distribution.  I just used the default nutch-site.xml and hadoop-site.xml files 
which have no configuration.  This defaults the filesystem to local and the 
distributed search server is able to find the files it needs on the local box.  
- 
- Whatever way you want to do it, if your index is on the local filesystem then 
the configuration needs to be pointed to use the local filesystem as show 
below.  This is usually set in the hadoop-site.xml file.
- 
- {{{
- <property>
-  <name>fs.default.name</name>
-   <value>local</value>
-   <description>The name of the default file system.  Either the
-   literal string "local" or a host:port for DFS.</description>
- </property>
- }}}
- 
- On each of the search servers you would use the startup the distributed 
search server by using the nutch server command like this:
- 
- {{{
- bin/nutch server 1234 /d01/local/crawl
- }}}
- 
- The arguments are the port to start the server on which must correspond with 
what you put into the search-servers.txt file and the local directory that is 
the parent of the index folder. Once the distributed search servers are started 
on each machine you can startup the website.  Searching should then happen 
normally with the exception of search results being pulled from the distributed 
search server indexes.  In the logs on the search website (usually catalina.out 
file), you should see messages telling you the number of servers and segments 
the website is attached to and searching.  This will allow you to know if you 
have your setup correct.
- 
- There is no command to shutdown the distributed search server process, you 
will simply have to kill it by hand.  The good news is that the website polls 
the servers in its search-servers.txt file to constantly check if they are up 
so you can shut down a single distributed search server, change out its index 
and bring it back up and the website will reconnect automatically.  This way 
the entire search is never down at any one point in time, only specific parts 
of the index would be down.
- 
- In a production environment searching is the biggest cost both in machines 
and electricity.  The reason is that once an index piece gets beyond about 2 
million pages it takes too much time to read from the disk so you can have a 
100 million page index on a single machine no matter how big the hard disk is.  
Fortunately using the distributed searching you can have multiple dedicated 
search servers each with their own piece of the index that are searched in 
parallel.  This allow very large index system to be searched efficiently.
- 
- Doing the math, a 100 million page system would take about 50 dedicated 
search servers to serve 20+ queries per second.  One way to get around having 
to have so many machines is by using multi-processor machine with multiple 
disks running multiple search servers each using a separate disk and index.  
Going down this route you can cut machine cost down by as much as 50% and 
electricity costs down by as much as 75%.  A multi-disk machine can't handle 
the same number of queries per second as a dedicated single disk machine but 
the number of index pages it can handle is significantly greater so it averages 
out to be much more efficient.
  
  == Rsyncing Code to Slaves ==
  
--------------------------------------------------------------------------------
@@ -587, +481 @@

  
  == Conclusion ==
  
--------------------------------------------------------------------------------
- I know this has been a lengthy tutorial but hopefully it has gotten you 
familiar with both nutch and hadoop.  Both Nutch and Hadoop are complicated 
applications and setting them up as you have learned is not necessarily an easy 
task.  I hope that this document has helped to make it easier for you.
+ Although this has been a rather lengthy tutorial, hopefully it has gotten you 
familiar with both nutch and hadoop. Both Nutch and Hadoop are complicated 
applications and setting them up as you have learned is not necessarily an easy 
task. However we hope that this document has helped to make it easier for you.
  
- If you have any comments or suggestions feel free to email them to me at 
[email protected].  If you have questions about Nutch or Hadoop they 
should be addressed to their respective mailing lists.  Below are general 
resources that are helpful with operating and developing Nutch and Hadoop.
+ If you have any comments or suggestions feel free to tell us on 
[email protected], details for joining our community lists can be found 
[[http://nutch.apache.org/mailing_lists.html|here]]. If you have questions 
about Nutch or Hadoop they should be addressed to their respective mailing 
lists.  Below are general resources that are helpful with operating and 
developing Nutch and Hadoop.
  
  == Updates ==
  
--------------------------------------------------------------------------------
@@ -618, +512 @@

  
  http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=144
  
- Hadoop 0.1.2-dev API:
- 
- http://www.netlikon.de/docs/javadoc-hadoop-0.1/overview-summary.html
- 
- ----
- 
-  * - I, StephenHalsey, have used this tutorial and found it very useful, but 
when I tried to add additional datanodes I got error messages in the logs of 
those datanodes saying "2006-07-07 18:58:18,345 INFO 
org.apache.hadoop.dfs.DataNode: Exception: 
org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.dfs.UnregisteredDatanodeException: Data node 
linux89-1:50010is attempting to report storage ID DS-1437847760. Expecting 
DS-1437847760.".  I think this was because the hadoop/filesystem/data/storage 
file was the same on the new data nodes and they had the same data as the one 
that had been copied from the original.  To get round this I turned everything 
off using bin/stop-all.sh on the name-node and deleted everything in the 
/filesystem directory on the new datanodes so they were clean and ran 
bin/start-all.sh on the namenode and then saw that the filesystem on the new 
datanodes had been created with new hadoop/filesystem/data/storage files and 
new directories and everything seemed to work fine from then on.  This probably 
is not a problem if you do follow the above process without starting any 
datanodes because they will all be empty, but was for me because I put some 
data onto the dfs of the single datanode system before copying it all onto the 
new datanodes.  I am not sure if I made some other error in following this 
process, but I have just added this note in case people who read this document 
experience the same problem.  Well done for the tutorial by the way, very 
helpful. Steve.
- 
- ----
- 
-  * nice tutorial! I tried to set it up without having fresh boxes available, 
just for testing (nutch 0.8). I ran into a few problems. But I finally got it 
to work. Some gotchas:
-   * use absolute paths for the DFS locations. Sounds strange that I used 
this, but I wanted to set up a single hadoop node on my Windows laptop, then 
extend on a Linux box. So relative path names would have come in handy, as they 
would be the same for both machines. Don't try that. Won't work. The DFS showed 
a ".." directory which disappeared when I switched to absolute paths.
-   * I had problems getting DFS to run on Windows at all. I always ended up 
getting this exception: "Could not complete write to file 
e:/dev/nutch-0.8/filesystem/mapreduce/system/submit_2twsuj/.job.jar.crc by 
DFSClient_-1318439814 - seems nutch hasn't been tested much on Windows. So, use 
Linux.
-   * don't use DFS on an NFS mount (this would be pretty stupid anyway, but 
just for testing, one might just set it up into an NFS homre directory). DFS 
uses locks, and NFS may be configured to not allow them.
-   * When you first start up hadoop, there's a warning in the namenode log, 
"dfs.StateChange - DIR* FSDirectory.unprotectedDelete: failed to remove 
e:/dev/nutch-0.8/filesystem/mapreduce/.system.crc because it does not exist" - 
You can ignore that.
-   * If you get errors like, "failed to create file [...] on client [foo] 
because target-length is 0, below MIN_REPLICATION (1)" this means a block could 
not be distributed. Most likely there is no datanode running, or the datanode 
has some severe problem (like the lock problem mentioned above).
- 
- ----
-   
-   * This tutorial worked well for me, however, I ran into a problem where my 
crawl wasn't working.   Turned out, it was because I needed to set the user 
agent and other properties for the crawl.  If anyone is reading this, and 
running into the same problem, look at the updated tutorial 
http://wiki.apache.org/nutch/Nutch0%2e9-Hadoop0%2e10-Tutorial?highlight=%28hadoop%29%7C%28tutorial%29
- 
- ----
- 
-   * By default Nutch will read only the first 100 links on a page.  This will 
result in incomplete indexes when scanning file trees.  So I set the "max 
outlinks per page" option to -1 in nutch-site.conf and got complete indexes.
- {{{
- <property>
-   <name>db.max.outlinks.per.page</name>
-   <value>-1</value>
-   <description>The maximum number of outlinks that we'll process for a page.
-   If this value is nonnegative (>=0), at most db.max.outlinks.per.page 
outlinks
-   will be processed for a page; otherwise, all outlinks will be processed.
-   </description>
- </property>
- }}}
-

[Nutch Wiki] Trivial Update of "NutchHadoopTutorial" by LewisJohnMcgibbney

Reply via email to