Corrado, Would it be possible for you to add this to the Wiki?
Also, there are several other tutorials: http://lucene.apache.org/nutch/tutorial8.html http://wiki.apache.org/nutch/NutchTutorial http://wiki.apache.org/nutch/NutchHadoopTutorial Maybe you can combine them? Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share ----- Original Message ---- From: zzcgiacomini <[EMAIL PROTECTED]> To: nutch-user@lucene.apache.org Sent: Wednesday, April 4, 2007 10:53:54 AM Subject: [Nutch-general] Nutch Step by Step Maybe someone will find this useful ? I have spent sometime playing with nutch-0 and collecting notes from the mailing lists ... may be someone will find these notes useful end could point me out mistakes I am not at all a nutch expert... -Corrado 0) CREATE NUTCH USER AND GROUP Create a nutch user and group and perform all the following logged in as nutch user. put this line in your .bash_profile export JAVA_HOME=/opt/jdk export PATH=$JAVA_HOME/bin:$PATH 1) GET HADOOP and NUTCH downloaded the nutch and hadoop trunks as well explained on http://lucene.apache.org/hadoop/version_control.html (svn checkout http://svn.apache.org/repos/asf/lucene/nutch/trunk) (svn checkout http://svn.apache.org/repos/asf/lucene/hadoop/trunk) 2) BUILD HADOOP Ex: Build and produce the tar file cd hadoop/trunk ant tar To build hadoop with native libraries 64bits proceed as follow : A ) dowonload and install latest lzo library (http://www.oberhumer.com/opensource/lzo/download/) Note: the current available pkgs for fc5 are too old tar xvzf lzo-2.02.tar.gz cd lzo-2.02 ./configure --prefix=/opt/lzo-2.02 make install B) compile native 64bit libs for hadoop if needed cd hadoop/trunk/src/native export LDFLAGS=-L/opt/jdk/jre/lib/amd64/server export JVM_DATA_MODEL=64 CCFLAGS="-I/opt/lzo-2.02/include" CPPFLAGS="-I/opt/lzo-2.02/include" ./configure cp src/org_apache_hadoop.h src/org/apache/hadoop/io/compress/zlib/ cp src/org_apache_hadoop.h ./src/org/apache/hadoop/io/compress/lzo cp src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibCompressor.h cp src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibDecompressor.h in config.h replace the line #define HADOOP_LZO_LIBRARY libnotfound.so with this one #define HADOOP_LZO_LIBRARY "liblzo2.so" make 3) BUILD NUTCH nutch-dev nigthly trunk now comes with hadoop-0.12.jar but may be you want to put the last nightly build hadoop jar mv nutch/trunk/lib/hadoop-0.12.jar nutch/trunk/lib/hadoop-0.12.jar.ori cp hadoop/trunk/build/hadoop-0.12.3-dev.jar nutch/trunk/lib/hadoop-0.12.jar cd nutch/trunk ant tar 4) INSTALL copy and untar the genearated .tar.gz file on the machines that will participate to the engine activities In my case I only have two identical machines available called myhost2 and myhost1. On each of them I have installed nutch binaries under /opt/nutch while I have dicided to have the hadoop distributed filesystem in a directory called hadoopFs located under a large disk munted on /disk10 on both machines create the directory: mkdir /disk10/hadoopFs/ copy hadoop 64bit native libraries if needed mkdir /opt/nutch/lib/native/Linux-x86_64 cp -fl hadoop/trunk/src/native/lib/.libs/* /opt/nutch/lib/native/Linux-x86_64 5) CONFIG I will use the myhost1 as the master machine running the nodename and jobtracker tasks; it will also run the datanode and tasktraker on it. myhost2 will only run datanode and takstraker. A) on both the machines change the conf/hadoop-site.xml configuration file. Here are values I have used fs.default.name : myhost1.mydomain.org:9010 mapred.job.tracker : myhost1.mydomain.org:9011 mapred.map.tasks : 40 mapred.reduce.tasks : 3 dfs.name.dir : /opt/hadoopFs/name dfs.data.dir : /opt/hadoopFs/data mapred.system.dir : /opt/hadoopFs/mapreduce/system mapred.local.dir : /opt/hadoopFs/mapreduce/local dfs.replication : 2 "The mapred.map.tasks property tell how many tasks you want to run in parallel. This should be a multiple of the number of computers that you have. In our case since we are starting out with 2 computer we will have 4 map and 4 reduce tasks. "The dfs.replication property states how many servers a single file should be replicated to before it becomes available. Because we are using 2 servers I have set this at 2. may be you want also change nutch-site by adding with a different value then the default of 3 http.redirect.max : 10 B) be sure that your conf/slaves file contains the name of the slaves machines. In my cases: myhost1.mydomain.org myhost2.mydomain.org C) create directories for pids and log files on both machines mkdir /opt/nutch/pids mkdir /opt/nutch/logs D) on both machines change conf/hadoop-env.sh file to point to the right java and nutch installation. export HADOOP_HOME=/opt/nutch export JAVA_HOME=/opt/jdk export HADOOP_LOG_DIR=${HADOOP_HOME}/logs export HADOOP_PID_DIR=${HADOOP_HOME}/pids E) Because of a problem on the classloader in nutch the following lines need to be set in nutch/bin/hadoop script file before it star building the CLASSSPATH variable for f in $HADOOP_HOME/nutch-*.jar; do CLASSPATH=${CLASSPATH}:$f; done This will put nutch-*.jar file into CLASSPATH 6) SSH SETUP ( Important!! ) Setup ssh as explained in http://wiki.apache.org/nutch/NutchHadoopTutorial and test the ability to password-less login on itself and from myhost1 to bas24 and viceversa. This is a very important step to avoid communication refused problems between daemons. Here is a short example on how to proceed : A) use ssh-keygen to create .ssh/id_dsa files : ssh-keygen -t dsa Generating public/private dsa key pair. Enter file in which to save the key (/home/nutch/.ssh/id_dsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/nutch/.ssh/id_dsa. Your public key has been saved in /home/nutch/.ssh/id_dsa.pub. The key fingerprint is: 01:36:6c:9d:27:09:54:e4:ff:fb:20:86:8c:e1:6c:82 [EMAIL PROTECTED] B) copy .ssh/id_dsa.pub on all machines as .ssh/authorized_keys C) on each machine configure ssh-agent to start at login adding a line in .xsession ex : ssh-agent startkde. or eval `ssh-agent` in .bashrc ( this will start an ssh-agent for every new shell) D) Use ssh-ad to add the dsa key ssh-add Enter passphrase for /home/nutch/.ssh/id_dsa: Identity added: /home/nutch/.ssh/id_dsa (/home/nutch/.ssh/id_dsa) 7) FORMAT HADOOP FILESYSTEM "Fix for HADOOP-19. A namenode must now be formatted before it may be used. Attempts to start a namenode in an unformatted directory will fail, rather than automatically creating a new, empty filesystem, causing existing datanodes to delete all blocks. Thus a mis-configured dfs.data.dir should no longer cause data loss" on the master machine (myhost1) run these command: cd /opt/nutch/ bin/hadoop namenode -format This will create the /opt/hadoopFs/name/image directory 8) START NODENAME start the namenode on the master machine (myhost1) bin/hadoop-daemon.sh start namenode starting namenode, logging to /opt/nutch/logs/hadoop-nutch-namenode-myhost1.mydomain.org.out 060509 150431 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060509 150431 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060509 150431 directing logs to directory /opt/nutch/logs 9) START DATANODES starting datanode on the master and all slaves machines (myhost1 and myhost2) on myhost1: bin/hadoop-daemon.sh start datanode tarting datanode, logging to /opt/nutch/logs/hadoop-nutch-datanode-myhost1.mydomain.org.out 060509 150619 0x0000000a parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060509 150619 0x0000000a parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060509 150619 0x0000000a directing logs to directory /opt/nutch/logs on myhost2: bin/hadoop-daemon.sh start datanode starting datanode, logging to /opt/nutch/logs/hadoop-nutch-datanode-myhost2.mydomain.org.out 060509 151517 0x0000000a parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060509 151517 0x0000000a parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060509 151517 0x0000000a directing logs to directory /opt/nutch/logs 10) START JOBTRAKER start jobtracker on the master machine (myhost1) on myhost1 bin/hadoop-daemon.sh start jobtracker starting jobtracker, logging to /opt/nutch/logs/hadoop-nutch-jobtracker-myhost1.mydomain.org.out 060509 152020 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060509 152021 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060509 152021 directing logs to directory /opt/nutch/logs 11) START TASKTARKERS start tasktracker on the slaves machines (myhost2 and myhost1) on myhost1: bin/hadoop-daemon.sh start tasktracker starting tasktracker, logging to /opt/nutch/logs/hadoop-nutch-tasktracker-myhost1.mydomain.org.out 060509 152236 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060509 152236 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml 060509 152236 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060509 152236 directing logs to directory /opt/nutch/logs on myhost2: bin/hadoop-daemon.sh start tasktracker starting tasktracker, logging to /opt/nutch/logs/hadoop-nutch-tasktracker-myhost2.mydomain.org.out 060509 152333 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060509 152333 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml 060509 152333 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060509 152333 directing logs to directory /opt/nutch/logs NOTE: Now that we have verified that daemons start and connects properly we can star and stop all of them using the start-all.sh and stop-all. scripts from the master machine 12) TEST FUNCTIONALITY Test hadoop functionality ... just a simple ls bin/hadoop dfs -ls 060509 152844 parsing jar:file:/home/opt/nutch-0.8-dev/lib/hadoop-0.2-dev.jar!/hadoop-default.xml 060509 152845 parsing file:/home/opt/nutch-0.8-dev/conf/hadoop-site.xml 060509 152845 No FS indicated, using default:localhost:9010 060509 152845 Client connection to 127.0.0.1:9010: starting Found 0 items The dfs filesystem is empty.. of course.. 13) CRATE FILE FOR URLs INJECT Now we need to create a crawldb and inject URLs in it. These initial URLs will be used then for the initial crawling. Let's inject URLs from the DMOZ Open Directory. First we must download and uncompress the file listing all of the DMOZ pages. (This is about 300MB compressed file, which uncompressed has 2GB in size, so this will take a few minutes.) on myhost1 machine where we run the nodename: cd /disk10 wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz gunzip content.rdf.u8.gz mkdir dmzo A) 5 Milion pages DMOZ contains around 5 million URLs. /opt/nutch-0.8-dev/bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 > dmoz/urls 060510 104615 parsing jar:file:/home/opt/nutch-0.8-dev/lib/hadoop-0.2-dev.jar!/hadoop-default.xml 060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/nutch-default.xml 060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/nutch-site.xml 060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/hadoop-site.xml 060510 104615 skew = -2131431075 060510 104615 Begin parse 060510 104616 Client connection to myhost1:9010: starting 060510 105156 Completed parse. Found 4756391 pages. B) as as second choice we can also select a random subset of these pages. (We can use a random subset so that everyone who runs this tutorial doesn't hammer the same sites.) DMOZ contains around five million URLs. We select one out of every 1000, so that we end up with around 50000 of URLs: bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 100 > dmoz/urls 060510 104615 parsing jar:file:/home/opt/nutch-0.8-dev/lib/hadoop-0.2-dev.jar!/hadoop-default.xml 060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/nutch-default.xml 060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/nutch-site.xml 060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/hadoop-site.xml 060510 104615 skew = -736060357 060510 104615 Begin parse 060510 104615 Client connection to myhost1:9010: starting 060510 104615 Completed parse. Found 49498 pages. Here I go for choice B The parser also takes a few minutes, as it must parse the full 2GB file. Finally, we initialize the crawl db with the selected urls. bin/hadoop dfs -put /disk10/dmoz dmoz 060510 101321 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060510 101321 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060510 101321 No FS indicated, using default:myhost1.mydomain.org:9010 060510 101321 Client connection to 10.234.57.38:9010: starting 060510 101321 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060510 101321 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml bin/hadoop dfs -lsr dmoz 060510 134738 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060510 134738 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060510 134738 No FS indicated, using default:myhost1.mydomain.org:9010 060510 134738 Client connection to 10.234.57.38:9010: starting /user/nutch/dmoz <dir> /user/nutch/dmoz/urls <r 2> 57059180 14) CREATE CRAWLDB (INJECT URLs) create e crawldb and inject the urls into the web database. bin/nutch inject test/crawldb dmoz 060511 092330 Injector: starting 060511 092330 Injector: crawlDb: test/crawldb 060511 092330 Injector: urlDir: dmoz 060511 092330 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml 060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml 060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060511 092330 Injector: Converting injected urls to crawl db entries. 060511 092330 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml 060511 092330 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml 060511 092330 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml 060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml 060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060511 092330 Client connection to 10.234.57.38:9010: starting 060511 092330 Client connection to 10.234.57.38:9011: starting 060511 092330 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060511 092332 Running job: job_0001 060511 092333 map 0% reduce 0% 060511 092342 map 25% reduce 0% 060511 092344 map 50% reduce 0% 060511 092354 map 75% reduce 0% 060511 092402 map 100% reduce 0% 060511 092412 map 100% reduce 25% 060511 092414 map 100% reduce 75% 060511 092422 map 100% reduce 100% 060511 092423 Job complete: job_0001 060511 092423 Injector: Merging injected urls into crawl db. 060511 092423 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060511 092423 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml 060511 092423 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml 060511 092423 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml 060511 092423 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml 060511 092423 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060511 092424 Running job: job_0002 060511 092425 map 0% reduce 0% 060511 092442 map 25% reduce 0% 060511 092444 map 50% reduce 0% 060511 092454 map 75% reduce 0% 060511 092502 map 100% reduce 0% 060511 092511 map 100% reduce 25% 060511 092513 map 100% reduce 75% 060511 092522 map 100% reduce 100% 060511 092523 Job complete: job_0002 060511 092523 Injector: done this will create the test/crawldb folders int the dfs From nutch tutorial : "The crawl database, or crawldb. This contains information about every url known to Nutch, including whether it was fetched, and, if so, when." You can also see that the fisical filesystem were we put dsf as also changed few data block files have been created. This on both myhost1 and myhost2 machines which participate to the dfs tree /disk10/hadoopFs /disk10/hadoopFs |-- data | |-- data | | |-- blk_-1388015236827939264 | | |-- blk_-2961663541591843930 | | |-- blk_-3901036791232325566 | | |-- blk_-5212946459038293740 | | |-- blk_-5301517582607663382 | | |-- blk_-7397383874477738842 | | |-- blk_-9055045635688102499 | | |-- blk_-9056717903919576858 | | |-- blk_1330666339588899715 | | |-- blk_1868647544763144796 | | |-- blk_3136516483028291673 | | |-- blk_4297959992285923734 | | |-- blk_5111098874834542511 | | |-- blk_5224195282207865093 | | |-- blk_5554003155307698150 | | |-- blk_7122181909600991812 | | |-- blk_8745902888438265091 | | `-- blk_883778723937265061 | `-- tmp |-- mapreduce `-- name |-- edits `-- image `-- fsimage nutch readdb test/crawldb -dump tmp/crawldbDump1 hadoop dfs -lsr hadoop dfs -get tmp/crawldbDump1 tmp/ 15) CREATE FETCHLIST To fetch, we first need to generate a fetchlist from the injected URLs in the database. This generates a fetchlist for all of the pages due to be fetched. The fetchlist is placed in a newly created segment directory. The segment directory is named by the time it's created. bin/nutch generate test/crawldb test/segments 060511 101525 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml 060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml 060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060511 101525 Generator: starting 060511 101525 Generator: segment: test/segments/20060511101525 060511 101525 Generator: Selecting most-linked urls due for fetch. 060511 101525 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml 060511 101525 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml 060511 101525 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml 060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml 060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060511 101525 Client connection to 10.234.57.38:9010: starting 060511 101525 Client connection to 10.234.57.38:9011: starting 060511 101525 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060511 101527 Running job: job_0001 060511 101528 map 0% reduce 0% 060511 101546 map 50% reduce 0% 060511 101556 map 75% reduce 0% 060511 101606 map 100% reduce 0% 060511 101616 map 100% reduce 75% 060511 101626 map 100% reduce 100% 060511 101627 Job complete: job_0001 060511 101627 Generator: Partitioning selected urls by host, for politeness. 060511 101627 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060511 101627 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml 060511 101627 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml 060511 101627 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml 060511 101627 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml 060511 101627 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060511 101628 Running job: job_0002 060511 101629 map 0% reduce 0% 060511 101646 map 40% reduce 0% 060511 101656 map 60% reduce 0% 060511 101706 map 80% reduce 0% 060511 101717 map 100% reduce 0% 060511 101726 map 100% reduce 100% 060511 101727 Job complete: job_0002 060511 101727 Generator: done At the end of this will have the new fetchlist created in test/segments/20060511101525/crawl_generate/part-00000 <r 2> 777933 test/segments/20060511101525/crawl_generate/part-00001 <r 2> 751088 test/segments/20060511101525/crawl_generate/part-00002 <r 2> 988871 test/segments/20060511101525/crawl_generate/part-00003 <r 2> 833454 nutch readseg -dump test/segments/20061027135841 test/segments/20061027135841/gendump -nocontent -nofetch -noparse -noparsedata -noparsetext 16) FETCH Now we run the fetcher on the created segment. This will load the web pages into the segment. bin/nutch fetch test/segments/20060511101525 060511 101820 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060511 101820 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml 060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml 060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060511 101821 Fetcher: starting 060511 101821 Fetcher: segment: test/segments/20060511101525 060511 101821 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml 060511 101821 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml 060511 101821 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml 060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml 060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060511 101821 Client connection to 10.234.57.38:9011: starting 060511 101821 Client connection to 10.234.57.38:9010: starting 060511 101821 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060511 101822 Running job: job_0003 060511 101823 map 0% reduce 0% 060511 110818 map 25% reduce 0% 060511 112428 map 50% reduce 0% 060511 122241 map 75% reduce 0% 060511 133613 map 100% reduce 0% 060511 133823 map 100% reduce 100% 060511 133824 Job complete: job_0003 060511 133824 Fetcher: done 17) UPDATE CRAWLDB When the fetcher is complete, we update the database with the results of the fetch This will add to the database entries for all of the pages referenced by the initial set in dmoz file. bin/nutch updatedb test/crawldb test/segments/20060511101525 060511 134940 CrawlDb update: starting 060511 134940 CrawlDb update: db: test/crawldb 060511 134940 CrawlDb update: segment: test/segments/20060511101525 060511 134940 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060511 134940 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml 060511 134940 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml 060511 134940 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml 060511 134940 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml 060511 134940 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060511 134940 Client connection to 10.234.57.38:9010: starting 060511 134940 CrawlDb update: Merging segment data into db. 060511 134940 Client connection to 10.234.57.38:9011: starting 060511 134940 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060511 134940 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060511 134941 Running job: job_0004 060511 134942 map 0% reduce 0% 060511 134954 map 17% reduce 0% 060511 135004 map 25% reduce 0% 060511 135013 map 33% reduce 0% 060511 135023 map 42% reduce 0% 060511 135024 map 50% reduce 0% 060511 135034 map 58% reduce 0% 060511 135044 map 67% reduce 0% 060511 135054 map 83% reduce 0% 060511 135104 map 92% reduce 0% 060511 135114 map 100% reduce 0% 060511 135124 map 100% reduce 100% 060511 135125 Job complete: job_0004 060511 135125 CrawlDb update: done A) We can now see the crawl statistics: bin/nutch readdb test/crawldb -stats 060511 135340 CrawlDb statistics start: test/crawldb 060511 135340 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060511 135340 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml 060511 135340 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml 060511 135340 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml 060511 135340 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml 060511 135340 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060511 135340 Client connection to 10.234.57.38:9010: starting 060511 135340 Client connection to 10.234.57.38:9011: starting 060511 135340 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060511 135341 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060511 135341 Running job: job_0005 060511 135342 map 0% reduce 0% 060511 135353 map 25% reduce 0% 060511 135354 map 50% reduce 0% 060511 135405 map 75% reduce 0% 060511 135414 map 100% reduce 0% 060511 135424 map 100% reduce 25% 060511 135425 map 100% reduce 50% 060511 135434 map 100% reduce 75% 060511 135444 map 100% reduce 100% 060511 135445 Job complete: job_0005 060511 135445 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060511 135445 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml 060511 135445 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml 060511 135445 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060511 135445 Statistics for CrawlDb: test/crawldb 060511 135445 TOTAL urls: 585055 060511 135445 avg score: 1.068 060511 135445 max score: 185.981 060511 135445 min score: 1.0 060511 135445 retry 0: 583943 060511 135445 retry 1: 1112 060511 135445 status 1 (DB_unfetched): 540202 060511 135445 status 2 (DB_fetched): 43086 060511 135445 status 3 (DB_gone): 1767 060511 135445 CrawlDb statistics: don "I believe the retry numbers are the number of times page fetches failed for recoverable errors and were re-processed before the page was fetched. So most of the pages were fetched on the first try. Some encountered errors and were fetched on the next try and so on. The default setting is a max 3 retrys in the db.fetch.retry.max property." B) We can now dump the crawled db to a flat file into dfs and get a copy out to a local file bin/nutch readdb test/crawldb -dump mydump 060511 135603 CrawlDb dump: starting 060511 135603 CrawlDb db: test/crawldb 060511 135603 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060511 135603 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml 060511 135603 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml 060511 135603 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml 060511 135603 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml 060511 135603 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060511 135603 Client connection to 10.234.57.38:9010: starting 060511 135603 Client connection to 10.234.57.38:9011: starting 060511 135603 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060511 135603 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060511 135604 Running job: job_0006 060511 135605 map 0% reduce 0% 060511 135624 map 50% reduce 0% 060511 135634 map 75% reduce 0% 060511 135644 map 100% reduce 0% 060511 135654 map 100% reduce 25% 060511 135704 map 100% reduce 100% 060511 135705 Job complete: job_0006 060511 135705 CrawlDb dump: done bin/hadoop dfs -lsr mydump 060511 135802 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060511 135802 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060511 135803 No FS indicated, using default:myhost1.mydomain.org:9010 060511 135803 Client connection to 10.234.57.38:9010: starting /user/nutch/mydump/part-00000 <r 2> 39031197 /user/nutch/mydump/part-00001 <r 2> 39186940 /user/nutch/mydump/part-00002 <r 2> 38954809 /user/nutch/mydump/part-00003 <r 2> 39171283 bin/hadoop dfs -get mydump/part-00000 mydumpFile 060511 135848 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060511 135848 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060511 135848 No FS indicated, using default:myhost1.mydomain.org:9010 060511 135848 Client connection to 10.234.57.38:9010: starting more mydumpFile gopher://csf.Colorado.EDU/11/ipe/Thematic_Archive/newsletters/africa_information_afrique_net/Angola Version: 4 Status: 1 (DB_unfetched) Fetch time: Thu May 11 13:38:09 CEST 2006 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 1.0666667 Signature: null Metadata: null gopher://gopher.gwdg.de/11/Uni/igdl Version: 4 Status: 1 (DB_unfetched) Fetch time: Thu May 11 13:37:03 CEST 2006 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 1.0140845 Signature: null Metadata: null gopher://gopher.jer1.co.il:70/00/jorgs/npo/camera/media/1994/npr Version: 4 Status: 1 (DB_unfetched) Fetch time: Thu May 11 13:36:48 CEST 2006 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 1.0105263 Signature: null Metadata: null ... ... ... 18) INVERT LINKS Before indexing we first invert all of the links, so that we may index incoming anchor text with the pages. We now need to generate a linkDb, that is done with all segments in your segments folder bin/nutch invertlinks linkdb test/segments/20060511101525 060511 140228 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml 060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml 060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060511 140228 LinkDb: starting 060511 140228 LinkDb: linkdb: linkdb 060511 140228 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060511 140228 Client connection to 10.234.57.38:9010: starting 060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml 060511 140228 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml 060511 140228 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml 060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml 060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060511 140228 LinkDb: adding segment: test/segments/20060511101525 060511 140228 Client connection to 10.234.57.38:9011: starting 060511 140228 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml 060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060511 140229 Running job: job_0007 060511 140230 map 0% reduce 0% 060511 140255 map 50% reduce 0% 060511 140305 map 75% reduce 0% 060511 140314 map 100% reduce 0% 060511 140324 map 100% reduce 100% 060511 140325 Job complete: job_0007 060511 140325 LinkDb: done 23) INDEX SEGMENT To index the segment we use the index command, as follows. bin/nutch index test/indexes test/crawldb linkdb test/segments/20060511101525 060515 134738 Indexer: starting 060515 134738 Indexer: linkdb: linkdb 060515 134738 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.1.jar!/hadoop-default.xml 060515 134738 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml 060515 134738 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.1.jar!/mapred-default.xml 060515 134738 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.1.jar!/mapred-default.xml 060515 134738 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml 060515 134738 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060515 134738 Indexer: adding segment: test/segments/20060511101525 060515 134738 Client connection to 10.234.57.38:9010: starting 060515 134738 Client connection to 10.234.57.38:9011: starting 060515 134739 parsing jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.1.jar!/hadoop-default.xml 060515 134739 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 060515 134739 Running job: job_0006 060515 134741 map 0% reduce 0% 060515 134758 map 11% reduce 0% 060515 134808 map 18% reduce 0% 060515 134818 map 25% reduce 0% 060515 134827 map 38% reduce 2% 060515 134837 map 44% reduce 2% 060515 134847 map 50% reduce 9% 060515 134857 map 53% reduce 11% 060515 134908 map 59% reduce 13% 060515 134918 map 66% reduce 13% 060515 134928 map 71% reduce 13% 060515 134938 map 74% reduce 13% 060515 134948 map 88% reduce 16% 060515 134957 map 94% reduce 17% 060515 135007 map 100% reduce 22% 060515 135017 map 100% reduce 50% 060515 135028 map 100% reduce 78% 060515 135038 map 100% reduce 82% 060515 135048 map 100% reduce 87% 060515 135058 map 100% reduce 92% 060515 135108 map 100% reduce 97% 060515 135117 map 100% reduce 99% 060515 135118 map 100% reduce 100% 060515 135129 Job complete: job_0006 060515 135129 Indexer: done 24) Try Searching the engine using nutch itself Nutch looks for index and segements subdirectory of dfs in the directory defined by th searcher.dir property. edit the /nutch-site.xml and add the following lines: <property> <name>searcher.dir</name> <value>test</value> <description> Path to root of crawl. This directory is searched (in order) for either the file search-servers.txt, containing a list of distributed search servers, or the directory "index" containing merged indexes, or the directory "segments" containing segment indexes. </description> </property> This is where search look for stuff as explained in description. Now run the search using nutch itself, Example : /opt/nutch/bin/nutch org.apache.nutch.searcher.NutchBean developpement 26) Search the engine using the brawser. To search you need to have tomcat installed and put the nutch war file into tomcat servlet container. I have build and installed tomcat as /opt/tomcat Note: (important) Something interesting to note about the distributed filesystem is that it is user specific. If you store a directory urls under the filesystem with the nutch user, it is actually stored as /user/nutch/urls. What this means to us is that the user that does the crawl and stores it in the distributed filesystem must also be the user that starts the search, or no results will come back. You can try this yourself by logging in with a different user and runing the ls command. It won't find the directories because is it looking under a different directory /user/username instead of /user/nutch As explained above we need to run tomcat as nutch user in order to be sure to have search results; Be sure to have write permission to nutch logs directory and read permission on the rest of the tomcat installation: login as root chmod -R ugo+rx /opt/nutch chmod -R ugo+rwx /opt/nutch/logs export CATALINA_OPTS="-server -Xss256k -Xms768m -Xmx768m -Djava.net.preferIPv4Stack=true -Djava.awt.headless=true" rm -rf /opt/tomcat/webapps/ROOT* cp /opt/nutch/nutch*.war /opt/tomcat/webapps/ROOT.war /opt/tomcat/bin/startup.sh this should create a new webapps/ROOT rootdir We now have to ensure that the webapp (tomcat) can find the index and segments. Tomcat webapp will use the nutch configuration file under /opt/tomcat/webapps/ROOT/WEB-INF/classes copy in here your modified nutch configuration files from nutch/conf directory: cp /opt/nutch/conf/hadoop-site.xml /opt/tomcat/webapps/ROOT/WEB-INF/classes//hadoop-site.xml cp /opt/nutch/conf/hadoop-env.sh /opt/tomcat/webapps/ROOT/WEB-INF/classes//hadoop-env.sh cp /opt/nutch/conf/nutch-site.xml /opt/tomcat/webapps/ROOT/WEB-INF/classes//hadoop-site.xml now you will need to restart tomcat and enter the following URL into your brawser: http://localhost:8080 the nutch search page should appear 27) RECRAWLING Now that everything works we update our db with new URLS A) we create the fetch list with the to 100 scoring pages in the current DB bin/nutch generate test/crawldb test/segments -topN 100 this has generated the new segment : test/segments/20060516135945 B) Now we fetch the new pages bin/nutch fetch test/segments/20060516135945 C) The DB is now updated with the entries of the new pages bin/nutch updatedb test/crawldb test/segments/20060516135945 D) We now we inver links. I guess the I could have just invert links on test/segments/20060516135945 but here I do it on all segments bin/nutch invertlinks linkdb -dir test/segments E) Remove the test/indexes directory hadoop dfs -rm test/indexes F) Now we recreate indexes nutch index test/indexes test/crawldb linkdb test/segments/20060511101525 test/segments/20060516135945 G) DEDUP bin/nutch dedup test/indexes H) Merge indexes bin/nutch merge test/index test/indexes I) Now if you would like you can evene remove test/indexes I have also tried to index segment in separate indexes directory like : nutch index test/indexes1 linkdb test/segments/20060511101525 nutch index test/indexes2 linkdb test/segments/20060516135945 bin/nutch merge test/index test/indexes1 test/indexes2 it looks is working and this will avoid to index each segment all time we will instead index just the new segment and we just have to regenerate the new meeged index Another solution for merging coulde have been to index each segment into a different index directory: nutch index indexe1 test/crawldb linkdb test/segments/20060511101525 nutch index indexe2 test/crawldb linkdb test/segments/20060516135945 nutch merge test/index test/indexe1 test/indexe2 Another solution again is to merge the segment and index only the resulting merged segment but so far I did'nt succeed in doing so. # # #nutch crawl dmoz/urls -dir crawl-tinysite -depth 10 ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general