Tutorial Nutch 0.8 and Hadoop This tutorial derived by hadoop + nutch tutorial and other 0.8 tutorial found on wiky site and on google and "work fine!!!" At the end of tutorial you can found also a recrawl tutorial and rebuild index
#Format the hadoop namenode [EMAIL PROTECTED]:/nutch/search#bin/hadoop namenode -format Re-format filesystem in /nutch/filesystem/name ? (Y or N) Y Formatted /nutch/filesystem/name #Start Hadoop [EMAIL PROTECTED]:/nutch/search# bin/start-all.sh namenode running as process 16789. [EMAIL PROTECTED]'s password: jobtracker running as process 16866. [EMAIL PROTECTED]'s password: LSearchDev01: starting tasktracker, logging to /nutch/search/logs/hadoop-root-tasktracker-LSearchDev01.out #ls on hadoop file systems [EMAIL PROTECTED]:/nutch/search# [EMAIL PROTECTED]:/nutch/search# bin/hadoop dfs -ls Found 0 items #Hadoop work fine # use vi to add your site in http://www.yoursite.com format [EMAIL PROTECTED]:/nutch/search# vi urls.txt # Make urls directory on hadoop file system [EMAIL PROTECTED]:/nutch/search# bin/hadoop dfs -mkdir urls # Copy urls.txt file from linux file system to hadoop file system [EMAIL PROTECTED]:/nutch/search# bin/hadoop dfs -copyFromLocal urls.txt urls/urls.txt # List the file on hadoop file system [EMAIL PROTECTED]:/nutch/search# bin/hadoop dfs -lsr /user/root/urls <dir> /user/root/urls/urls.txt <r 2> 41 #If you want to delete the old urls file on hadoop and put a new one file system use the follow command [EMAIL PROTECTED]:/nutch/search# bin/hadoop dfs -rm /user/root/urls/urls.txt Deleted /user/root/urls/urls.txt [EMAIL PROTECTED]:/nutch/search# bin/hadoop dfs -copyFromLocal urls.txt urls/urls.txt #Start to inject the urls in the urls.txt to <crawld> dbase [EMAIL PROTECTED]:/nutch/search# bin/nutch inject crawld urls # (*) if you want to see what are the statu of job going to: http://127.0.0.1:50030 # This is the new situation of your hadoop file system now [EMAIL PROTECTED]:/nutch/search# bin/hadoop dfs -lsr /user/root/crawld <dir> /user/root/crawld/current <dir> /user/root/crawld/current/part-00000 <dir> /user/root/crawld/current/part-00000/data <r 2> 62 /user/root/crawld/current/part-00000/index <r 2> 33 /user/root/crawld/current/part-00001 <dir> /user/root/crawld/current/part-00001/data <r 2> 62 /user/root/crawld/current/part-00001/index <r 2> 33 /user/root/crawld/current/part-00002 <dir> /user/root/crawld/current/part-00002/data <r 2> 124 /user/root/crawld/current/part-00002/index <r 2> 74 /user/root/crawld/current/part-00003 <dir> /user/root/crawld/current/part-00003/data <r 2> 181 /user/root/crawld/current/part-00003/index <r 2> 74 /user/root/urls <dir> /user/root/urls/urls.txt <r 2> 64 # Now you can generate the file for fetch job [EMAIL PROTECTED]:/nutch/search# bin/nutch generate /user/root/crawld /user/root/crawld/segments # (*) if you want to see what are the statu of job going to: http://127.0.0.1:50030 # This /user/root/crawld/segments/20060722130642 is the name of the segment that you want to fetch [EMAIL PROTECTED]:/nutch/search# bin/hadoop dfs -ls /user/root/crawld/segments Found 1 items /user/root/crawld/segments/20060722130642 <dir> [EMAIL PROTECTED]:/nutch/search# #Fetch the site list in urls.txt [EMAIL PROTECTED]:/nutch/search# bin/nutch fetch /user/root/crawld/segments/20060722130642 # (*) if you want to see what are the statu of job going to: http://127.0.0.1:50030 #This is what there are on your hadoop file systems now [EMAIL PROTECTED]:/nutch/search# bin/hadoop dfs -lsr /user/root/crawld <dir> /user/root/crawld/current <dir> /user/root/crawld/current/part-00000 <dir> /user/root/crawld/current/part-00000/data <r 2> 62 /user/root/crawld/current/part-00000/index <r 2> 33 /user/root/crawld/current/part-00001 <dir> /user/root/crawld/current/part-00001/data <r 2> 62 /user/root/crawld/current/part-00001/index <r 2> 33 /user/root/crawld/current/part-00002 <dir> /user/root/crawld/current/part-00002/data <r 2> 124 /user/root/crawld/current/part-00002/index <r 2> 74 /user/root/crawld/current/part-00003 <dir> /user/root/crawld/current/part-00003/data <r 2> 181 /user/root/crawld/current/part-00003/index <r 2> 74 /user/root/crawld/segments <dir> /user/root/crawld/segments/20060722130642 <dir> /user/root/crawld/segments/20060722130642/content <dir> /user/root/crawld/segments/20060722130642/content/part-00000 <dir> /user/root/crawld/segments/20060722130642/content/part-00000/data <r 2> 62 /user/root/crawld/segments/20060722130642/content/part-00000/index <r 2> 33 /user/root/crawld/segments/20060722130642/content/part-00001 <dir> /user/root/crawld/segments/20060722130642/content/part-00001/data <r 2> 62 /user/root/crawld/segments/20060722130642/content/part-00001/index <r 2> 33 /user/root/crawld/segments/20060722130642/content/part-00002 <dir> /user/root/crawld/segments/20060722130642/content/part-00002/data <r 2> 2559 /user/root/crawld/segments/20060722130642/content/part-00002/index <r 2> 74 /user/root/crawld/segments/20060722130642/content/part-00003 <dir> /user/root/crawld/segments/20060722130642/content/part-00003/data <r 2> 6028 /user/root/crawld/segments/20060722130642/content/part-00003/index <r 2> 74 /user/root/crawld/segments/20060722130642/crawl_fetch <dir> /user/root/crawld/segments/20060722130642/crawl_fetch/part-00000 <dir> /user/root/crawld/segments/20060722130642/crawl_fetch/part-00000/data <r 2> 62 /user/root/crawld/segments/20060722130642/crawl_fetch/part-00000/index <r 2> 33 /user/root/crawld/segments/20060722130642/crawl_fetch/part-00001 <dir> /user/root/crawld/segments/20060722130642/crawl_fetch/part-00001/data <r 2> 62 /user/root/crawld/segments/20060722130642/crawl_fetch/part-00001/index <r 2> 33 /user/root/crawld/segments/20060722130642/crawl_fetch/part-00002 <dir> /user/root/crawld/segments/20060722130642/crawl_fetch/part-00002/data <r 2> 140 /user/root/crawld/segments/20060722130642/crawl_fetch/part-00002/index <r 2> 74 /user/root/crawld/segments/20060722130642/crawl_fetch/part-00003 <dir> /user/root/crawld/segments/20060722130642/crawl_fetch/part-00003/data <r 2> 213 /user/root/crawld/segments/20060722130642/crawl_fetch/part-00003/index <r 2> 74 /user/root/crawld/segments/20060722130642/crawl_generate <dir> /user/root/crawld/segments/20060722130642/crawl_generate/part-00000 <r 2> 119 /user/root/crawld/segments/20060722130642/crawl_generate/part-00001 <r 2> 124 /user/root/crawld/segments/20060722130642/crawl_generate/part-00002 <r 2> 124 /user/root/crawld/segments/20060722130642/crawl_generate/part-00003 <r 2> 62 /user/root/crawld/segments/20060722130642/crawl_parse <dir> /user/root/crawld/segments/20060722130642/crawl_parse/part-00000 <r 2> 62 /user/root/crawld/segments/20060722130642/crawl_parse/part-00001 <r 2> 62 /user/root/crawld/segments/20060722130642/crawl_parse/part-00002 <r 2> 784 /user/root/crawld/segments/20060722130642/crawl_parse/part-00003 <r 2> 1698 /user/root/crawld/segments/20060722130642/parse_data <dir> /user/root/crawld/segments/20060722130642/parse_data/part-00000 <dir> /user/root/crawld/segments/20060722130642/parse_data/part-00000/data <r 2> 61 /user/root/crawld/segments/20060722130642/parse_data/part-00000/index <r 2> 33 /user/root/crawld/segments/20060722130642/parse_data/part-00001 <dir> /user/root/crawld/segments/20060722130642/parse_data/part-00001/data <r 2> 61 /user/root/crawld/segments/20060722130642/parse_data/part-00001/index <r 2> 33 /user/root/crawld/segments/20060722130642/parse_data/part-00002 <dir> /user/root/crawld/segments/20060722130642/parse_data/part-00002/data <r 2> 839 /user/root/crawld/segments/20060722130642/parse_data/part-00002/index <r 2> 74 /user/root/crawld/segments/20060722130642/parse_data/part-00003 <dir> /user/root/crawld/segments/20060722130642/parse_data/part-00003/data <r 2> 1798 /user/root/crawld/segments/20060722130642/parse_data/part-00003/index <r 2> 74 /user/root/crawld/segments/20060722130642/parse_text <dir> /user/root/crawld/segments/20060722130642/parse_text/part-00000 <dir> /user/root/crawld/segments/20060722130642/parse_text/part-00000/data <r 2> 61 /user/root/crawld/segments/20060722130642/parse_text/part-00000/index <r 2> 33 /user/root/crawld/segments/20060722130642/parse_text/part-00001 <dir> /user/root/crawld/segments/20060722130642/parse_text/part-00001/data <r 2> 61 /user/root/crawld/segments/20060722130642/parse_text/part-00001/index <r 2> 33 /user/root/crawld/segments/20060722130642/parse_text/part-00002 <dir> /user/root/crawld/segments/20060722130642/parse_text/part-00002/data <r 2> 377 /user/root/crawld/segments/20060722130642/parse_text/part-00002/index <r 2> 74 /user/root/crawld/segments/20060722130642/parse_text/part-00003 <dir> /user/root/crawld/segments/20060722130642/parse_text/part-00003/data <r 2> 811 /user/root/crawld/segments/20060722130642/parse_text/part-00003/index <r 2> 74 /user/root/urls <dir> /user/root/urls/urls.txt <r 2> 64 #Now you need to do the invertlinks JOB [EMAIL PROTECTED]:/nutch/search# bin/nutch invertlinks /user/root/crawld/linkdb /user/root/crawld/segments/20060722130642 #And at the end you need to build your index [EMAIL PROTECTED]:/nutch/search# bin/nutch index /user/root/crawld/indexes /user/root/crawld/ /user/root/crawld/linkdb /user/root/crawld/segments/20060722130642 [EMAIL PROTECTED]:/nutch/search# bin/hadoop dfs -ls /user/root/crawld Found 4 items /user/root/crawld/current <dir> /user/root/crawld/indexes <dir> /user/root/crawld/linkdb <dir> /user/root/crawld/segments <dir> [EMAIL PROTECTED]:/nutch/search# At the end of your hard job you have on your hadoop file system this directory So you are ready to start tomcat . Before you start tomcat remeber to change the path of your search directory in the file nutch-site.xml in webapps/ROOT/web-inf/classes directory #This is an example of my configuration <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>LSearchDev01:9000</value> </property> <property>bin <name>searcher.dir</name> <value>/user/root/crawld</value> </property> </configuration> ~ ~ #RECRAWL AND NEW INJECT # Create a new indexe0 bin/nutch index /user/root/crawld/indexe0 /user/root/crawld/ /user/root/crawld/linkdb /user/root/crawld/segments/20060722153133 # Create a new index1 bin/nutch index /user/root/crawld/indexe1 /user/root/crawld/ /user/root/crawld/linkdb /user/root/crawld/segments/20060722182213 #Dedup the new indexe0 bin/nutch dedup /user/root/crawld/indexe0 #Dedup the new index1 bin/nutch dedup /user/root/crawld/indexe1 #Delete the old index #Merge the new index merge directory bin/nutch merge /user/root/crawld/index /user/root/crawld/indexe0 /user/root/crawld/indexe1 ... #(and the other index create for the fetch segments) #index is the stardard directory in the crawld (DB) where there is a merge master index I hope that i Help someone to do they first search engine on nutch 0.8 + hadoop :) Best crawling Roberto Navoni
