Nutch Step by Step Maybe someone will find this useful ?

zzcgiacomini Wed, 04 Apr 2007 07:55:38 -0700

I have spent sometime playing with nutch-0 and collecting notes from themailing lists ...may be someone will find these notes useful end could point me outmistakes

I am not at all a nutch expert...
-Corrado

 0) CREATE NUTCH USER AND GROUP

    Create a nutch user and group and perform all the following logged in as 
nutch user.
    put this line in your .bash_profile

    export JAVA_HOME=/opt/jdk
    export PATH=$JAVA_HOME/bin:$PATH

 1) GET HADOOP and NUTCH

    downloaded the nutch and hadoop trunks as well explained on 
    http://lucene.apache.org/hadoop/version_control.html
    (svn checkout http://svn.apache.org/repos/asf/lucene/nutch/trunk)
    (svn checkout http://svn.apache.org/repos/asf/lucene/hadoop/trunk)

 2) BUILD HADOOP

    Ex: 

    Build and produce the tar file
    cd hadoop/trunk
    ant tar

    To build hadoop with native libraries 64bits proceed as follow :

    A ) dowonload and install latest lzo library 
(http://www.oberhumer.com/opensource/lzo/download/)
        Note: the current available pkgs for fc5 are too old 

        tar xvzf lzo-2.02.tar.gz
        cd lzo-2.02
        ./configure --prefix=/opt/lzo-2.02
        make install

    B) compile native 64bit libs for hadoop  if needed

        cd hadoop/trunk/src/native

        export LDFLAGS=-L/opt/jdk/jre/lib/amd64/server
        export JVM_DATA_MODEL=64

        CCFLAGS="-I/opt/lzo-2.02/include" CPPFLAGS="-I/opt/lzo-2.02/include" 
./configure

        cp src/org_apache_hadoop.h src/org/apache/hadoop/io/compress/zlib/
        cp src/org_apache_hadoop.h ./src/org/apache/hadoop/io/compress/lzo
        cp 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibCompressor.h
        cp 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibDecompressor.h

        in config.h replace the line

        #define HADOOP_LZO_LIBRARY libnotfound.so 

        with this one

        #define HADOOP_LZO_LIBRARY "liblzo2.so"
        make         

 3) BUILD NUTCH

    nutch-dev nigthly trunk now comes with hadoop-0.12.jar but may be you want 
to put the last nightly build hadoop jar 

    mv nutch/trunk/lib/hadoop-0.12.jar nutch/trunk/lib/hadoop-0.12.jar.ori
    cp hadoop/trunk/build/hadoop-0.12.3-dev.jar nutch/trunk/lib/hadoop-0.12.jar
    cd nutch/trunk
    ant tar

 4) INSTALL

    copy and untar the genearated .tar.gz file on the machines that will 
participate to the engine activities
    In my case I only have two identical machines available called myhost2 and 
myhost1. 

    On each of them I have installed nutch binaries under /opt/nutch while I 
have dicided to have the hadoop 
    distributed filesystem in a directory called hadoopFs located under a large 
disk munted on /disk10

    
    on both machines create the directory:
    mkdir /disk10/hadoopFs/ 

    copy hadoop 64bit native libraries  if needed
    
    mkdir /opt/nutch/lib/native/Linux-x86_64
    cp -fl hadoop/trunk/src/native/lib/.libs/* 
/opt/nutch/lib/native/Linux-x86_64

 5) CONFIG

    I will use the myhost1 as the master machine running the nodename and 
jobtracker tasks; it will also run the datanode and tasktraker on it.
    myhost2 will only run datanode and takstraker.

    A) on both the machines change the conf/hadoop-site.xml configuration file. 
Here are values I have used 

       fs.default.name     : myhost1.mydomain.org:9010
       mapred.job.tracker  : myhost1.mydomain.org:9011
       mapred.map.tasks    : 40
       mapred.reduce.tasks : 3
       dfs.name.dir        : /opt/hadoopFs/name
       dfs.data.dir        : /opt/hadoopFs/data
       mapred.system.dir   : /opt/hadoopFs/mapreduce/system
       mapred.local.dir    : /opt/hadoopFs/mapreduce/local
       dfs.replication     : 2

       "The mapred.map.tasks property tell how many tasks you want to run in 
parallel.
        This should be a multiple of the number of computers that you have. 
        In our case since we are starting out with 2 computer we will have 4 
map and 4 reduce tasks.

       "The dfs.replication property states how many servers a single file 
should be
       replicated to before it becomes available.  Because we are using 2 
servers I have set 
       this at 2. 

       may be you want also change nutch-site by adding  with a different value 
then the default of 3

       http.redirect.max   : 10

 
    B) be sure that your  conf/slaves file contains the name of the slaves 
machines. In my cases:

       myhost1.mydomain.org
       myhost2.mydomain.org

    C) create directories for pids and log files on both machines

       mkdir /opt/nutch/pids
       mkdir /opt/nutch/logs

    D) on both machines change conf/hadoop-env.sh file to point to the right 
java and nutch installation.

       export HADOOP_HOME=/opt/nutch
       export JAVA_HOME=/opt/jdk
       export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
       export HADOOP_PID_DIR=${HADOOP_HOME}/pids

    E) Because of a problem on the classloader in nutch the following lines 
need to be set
       in nutch/bin/hadoop script file before it star building the CLASSSPATH 
variable

       for f in $HADOOP_HOME/nutch-*.jar; do
         CLASSPATH=${CLASSPATH}:$f;
       done

       This will put nutch-*.jar file into CLASSPATH

 6) SSH SETUP ( Important!! )

    Setup ssh as explained in http://wiki.apache.org/nutch/NutchHadoopTutorial
    and test the ability to password-less login on itself and  from myhost1 to 
bas24 and viceversa.
    This is a very important step to avoid communication refused problems 
between daemons.

    Here is a short example on how to proceed :
    A) use ssh-keygen to create .ssh/id_dsa files :

        ssh-keygen -t dsa
        Generating public/private dsa key pair.
        Enter file in which to save the key (/home/nutch/.ssh/id_dsa):
        Enter passphrase (empty for no passphrase):
        Enter same passphrase again:
        Your identification has been saved in /home/nutch/.ssh/id_dsa.
        Your public key has been saved in /home/nutch/.ssh/id_dsa.pub.
        The key fingerprint is:
        01:36:6c:9d:27:09:54:e4:ff:fb:20:86:8c:e1:6c:82 [EMAIL PROTECTED]

    B)  copy .ssh/id_dsa.pub on all machines as .ssh/authorized_keys
    C)  on each machine configure ssh-agent to start at login  adding a line in 
.xsession
        ex : ssh-agent startkde.

        or eval `ssh-agent` in .bashrc ( this will start an ssh-agent for every 
new shell)
    D)  Use ssh-ad to add the dsa key

        ssh-add
        Enter passphrase for /home/nutch/.ssh/id_dsa:
        Identity added: /home/nutch/.ssh/id_dsa (/home/nutch/.ssh/id_dsa)


 7) FORMAT HADOOP FILESYSTEM 

    "Fix for HADOOP-19.  A namenode must now be formatted before it may be 
used.  Attempts to
    start a namenode in an unformatted directory will fail, rather than 
automatically
    creating a new, empty filesystem, causing existing datanodes to delete all 
blocks.  
    Thus a mis-configured dfs.data.dir should no longer cause data loss"

    on the master machine (myhost1)  run these command:
    cd /opt/nutch/
    bin/hadoop namenode -format

    This will create the /opt/hadoopFs/name/image directory

 8) START NODENAME

    start the namenode on the master machine (myhost1)

    bin/hadoop-daemon.sh start namenode

    starting namenode, logging to 
/opt/nutch/logs/hadoop-nutch-namenode-myhost1.mydomain.org.out
    060509 150431 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060509 150431 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060509 150431 directing logs to directory /opt/nutch/logs

 9) START DATANODES

    starting datanode on the master  and all slaves machines (myhost1 and 
myhost2)

    on myhost1:

    bin/hadoop-daemon.sh start datanode

    tarting datanode, logging to 
/opt/nutch/logs/hadoop-nutch-datanode-myhost1.mydomain.org.out
    060509 150619 0x0000000a parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060509 150619 0x0000000a parsing 
file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060509 150619 0x0000000a directing logs to directory /opt/nutch/logs

    on myhost2:

    bin/hadoop-daemon.sh start datanode

    starting datanode, logging to 
/opt/nutch/logs/hadoop-nutch-datanode-myhost2.mydomain.org.out
    060509 151517 0x0000000a parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060509 151517 0x0000000a parsing 
file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060509 151517 0x0000000a directing logs to directory /opt/nutch/logs


10) START JOBTRAKER

    start jobtracker  on the master machine (myhost1)

    on myhost1

    bin/hadoop-daemon.sh start jobtracker

    starting jobtracker, logging to 
/opt/nutch/logs/hadoop-nutch-jobtracker-myhost1.mydomain.org.out
    060509 152020 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060509 152021 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060509 152021 directing logs to directory /opt/nutch/logs

11)  START TASKTARKERS

    start tasktracker on the slaves machines (myhost2 and myhost1)

    on myhost1:

    bin/hadoop-daemon.sh start tasktracker

    starting tasktracker, logging to 
/opt/nutch/logs/hadoop-nutch-tasktracker-myhost1.mydomain.org.out
    060509 152236 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060509 152236 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
    060509 152236 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060509 152236 directing logs to directory /opt/nutch/logs

    on myhost2:

    bin/hadoop-daemon.sh start tasktracker

    starting tasktracker, logging to 
/opt/nutch/logs/hadoop-nutch-tasktracker-myhost2.mydomain.org.out
    060509 152333 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060509 152333 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
    060509 152333 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060509 152333 directing logs to directory /opt/nutch/logs

    NOTE: Now that we have verified that daemons start and connects properly we 
can  star and
          stop all of them  using the start-all.sh and stop-all. scripts from 
the master machine

12) TEST FUNCTIONALITY

    Test hadoop functionality ... just a simple ls
    
    bin/hadoop dfs -ls

    060509 152844 parsing 
jar:file:/home/opt/nutch-0.8-dev/lib/hadoop-0.2-dev.jar!/hadoop-default.xml
    060509 152845 parsing file:/home/opt/nutch-0.8-dev/conf/hadoop-site.xml
    060509 152845 No FS indicated, using default:localhost:9010
    060509 152845 Client connection to 127.0.0.1:9010: starting
    Found 0 items

    The dfs filesystem is empty.. of course..

13) CRATE FILE FOR URLs INJECT 

    Now we need to create a crawldb and inject URLs in it. These initial URLs 
will be used then for the initial crawling.
    Let's inject URLs from the DMOZ Open Directory. First we must download and 
uncompress the file listing all of the DMOZ pages. 
    (This is about 300MB compressed file, which uncompressed has 2GB in size, 
so this will take a few minutes.)

    on myhost1 machine where we run the  nodename:

    cd /disk10
    wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
    gunzip content.rdf.u8.gz
    mkdir dmzo

    A) 5 Milion pages
       DMOZ contains around 5 million URLs. 
       /opt/nutch-0.8-dev/bin/nutch org.apache.nutch.tools.DmozParser 
content.rdf.u8  > dmoz/urls
       060510 104615 parsing 
jar:file:/home/opt/nutch-0.8-dev/lib/hadoop-0.2-dev.jar!/hadoop-default.xml
       060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/nutch-default.xml
       060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/nutch-site.xml
       060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/hadoop-site.xml
       060510 104615 skew = -2131431075
       060510 104615 Begin parse
       060510 104616 Client connection to myhost1:9010: starting
       060510 105156 Completed parse.  Found 4756391 pages.


    B) as as second choice we can also select a random subset of these pages. 
       (We can use a random subset so that everyone who runs this tutorial 
doesn't hammer the same sites.) 
        DMOZ contains around five million URLs. We select one out of every 
1000, so that we end up with around 50000 of URLs:

       bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 100 > 
dmoz/urls
       060510 104615 parsing 
jar:file:/home/opt/nutch-0.8-dev/lib/hadoop-0.2-dev.jar!/hadoop-default.xml
       060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/nutch-default.xml
       060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/nutch-site.xml
       060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/hadoop-site.xml
       060510 104615 skew = -736060357
       060510 104615 Begin parse
       060510 104615 Client connection to myhost1:9010: starting
       060510 104615 Completed parse.  Found 49498 pages.

    Here I go for choice B

    The parser also takes a few minutes, as it must parse the full 2GB file. 
    Finally, we initialize the crawl db with the selected urls.
    
    bin/hadoop dfs -put /disk10/dmoz dmoz

    060510 101321 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060510 101321 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060510 101321 No FS indicated, using default:myhost1.mydomain.org:9010
    060510 101321 Client connection to 10.234.57.38:9010: starting
    060510 101321 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060510 101321 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml 

    bin/hadoop dfs -lsr dmoz

    060510 134738 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060510 134738 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060510 134738 No FS indicated, using default:myhost1.mydomain.org:9010
    060510 134738 Client connection to 10.234.57.38:9010: starting
    /user/nutch/dmoz        <dir>
    /user/nutch/dmoz/urls   <r 2>   57059180
    
14) CREATE CRAWLDB (INJECT URLs)

    create e crawldb and inject the urls into the web database.
    
    bin/nutch inject test/crawldb dmoz

    060511 092330 Injector: starting
    060511 092330 Injector: crawlDb: test/crawldb
    060511 092330 Injector: urlDir: dmoz
    060511 092330 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
    060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
    060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060511 092330 Injector: Converting injected urls to crawl db entries.
    060511 092330 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
    060511 092330 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
    060511 092330 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
    060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
    060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060511 092330 Client connection to 10.234.57.38:9010: starting
    060511 092330 Client connection to 10.234.57.38:9011: starting
    060511 092330 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060511 092332 Running job: job_0001
    060511 092333  map 0%  reduce 0%
    060511 092342  map 25%  reduce 0%
    060511 092344  map 50%  reduce 0%
    060511 092354  map 75%  reduce 0%
    060511 092402  map 100%  reduce 0%
    060511 092412  map 100%  reduce 25%
    060511 092414  map 100%  reduce 75%
    060511 092422  map 100%  reduce 100%
    060511 092423 Job complete: job_0001
    060511 092423 Injector: Merging injected urls into crawl db.
    060511 092423 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060511 092423 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
    060511 092423 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
    060511 092423 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
    060511 092423 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
    060511 092423 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060511 092424 Running job: job_0002
    060511 092425  map 0%  reduce 0%
    060511 092442  map 25%  reduce 0%
    060511 092444  map 50%  reduce 0%
    060511 092454  map 75%  reduce 0%
    060511 092502  map 100%  reduce 0%
    060511 092511  map 100%  reduce 25%
    060511 092513  map 100%  reduce 75%
    060511 092522  map 100%  reduce 100%
    060511 092523 Job complete: job_0002
    060511 092523 Injector: done


    this will create the test/crawldb folders int the dfs
    
    From nutch tutorial : 
         "The crawl database, or crawldb. This contains information about every 
url known to Nutch,
          including whether it was fetched, and, if so, when."

    You can also see that the fisical filesystem were we put  dsf as also 
changed few data
    block files have been created. This on both myhost1 and myhost2 machines 
which participate to the dfs

    tree /disk10/hadoopFs

    /disk10/hadoopFs
    |-- data
    |   |-- data
    |   |   |-- blk_-1388015236827939264
    |   |   |-- blk_-2961663541591843930
    |   |   |-- blk_-3901036791232325566
    |   |   |-- blk_-5212946459038293740
    |   |   |-- blk_-5301517582607663382
    |   |   |-- blk_-7397383874477738842
    |   |   |-- blk_-9055045635688102499
    |   |   |-- blk_-9056717903919576858
    |   |   |-- blk_1330666339588899715
    |   |   |-- blk_1868647544763144796
    |   |   |-- blk_3136516483028291673
    |   |   |-- blk_4297959992285923734
    |   |   |-- blk_5111098874834542511
    |   |   |-- blk_5224195282207865093
    |   |   |-- blk_5554003155307698150
    |   |   |-- blk_7122181909600991812
    |   |   |-- blk_8745902888438265091
    |   |   `-- blk_883778723937265061
    |   `-- tmp
    |-- mapreduce
    `-- name
        |-- edits
        `-- image
            `-- fsimage

    nutch  readdb test/crawldb -dump tmp/crawldbDump1
    hadoop dfs -lsr
    hadoop dfs  -get  tmp/crawldbDump1 tmp/

15) CREATE FETCHLIST

    To fetch, we first need to  generate a fetchlist from the injected URLs in 
the database.
 
    This generates a fetchlist for all of the pages due to be fetched. 
    The fetchlist is placed in a newly created segment directory. 
    The segment directory is named by the time it's created. 

    

    bin/nutch generate test/crawldb test/segments

    060511 101525 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
    060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
    060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060511 101525 Generator: starting
    060511 101525 Generator: segment: test/segments/20060511101525
    060511 101525 Generator: Selecting most-linked urls due for fetch.
    060511 101525 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
    060511 101525 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
    060511 101525 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
    060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
    060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060511 101525 Client connection to 10.234.57.38:9010: starting
    060511 101525 Client connection to 10.234.57.38:9011: starting
    060511 101525 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060511 101527 Running job: job_0001
    060511 101528  map 0%  reduce 0%
    060511 101546  map 50%  reduce 0%
    060511 101556  map 75%  reduce 0%
    060511 101606  map 100%  reduce 0%
    060511 101616  map 100%  reduce 75%
    060511 101626  map 100%  reduce 100%
    060511 101627 Job complete: job_0001
    060511 101627 Generator: Partitioning selected urls by host, for politeness.
    060511 101627 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060511 101627 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
    060511 101627 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
    060511 101627 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
    060511 101627 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
    060511 101627 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060511 101628 Running job: job_0002
    060511 101629  map 0%  reduce 0%
    060511 101646  map 40%  reduce 0%
    060511 101656  map 60%  reduce 0%
    060511 101706  map 80%  reduce 0%
    060511 101717  map 100%  reduce 0%
    060511 101726  map 100%  reduce 100%
    060511 101727 Job complete: job_0002
    060511 101727 Generator: done


    At the end of this will have the new fetchlist created in

    test/segments/20060511101525/crawl_generate/part-00000      <r 2>   777933
    test/segments/20060511101525/crawl_generate/part-00001      <r 2>   751088
    test/segments/20060511101525/crawl_generate/part-00002      <r 2>   988871
    test/segments/20060511101525/crawl_generate/part-00003      <r 2>   833454

    nutch readseg -dump test/segments/20061027135841 
test/segments/20061027135841/gendump -nocontent -nofetch -noparse -noparsedata 
-noparsetext

16) FETCH

    Now we run the fetcher on the created segment. This will load the web pages 
into the segment.
    
    bin/nutch fetch test/segments/20060511101525

    060511 101820 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060511 101820 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
    060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
    060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060511 101821 Fetcher: starting
    060511 101821 Fetcher: segment: test/segments/20060511101525
    060511 101821 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
    060511 101821 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
    060511 101821 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
    060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
    060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060511 101821 Client connection to 10.234.57.38:9011: starting
    060511 101821 Client connection to 10.234.57.38:9010: starting
    060511 101821 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060511 101822 Running job: job_0003
    060511 101823  map 0%  reduce 0%
    060511 110818  map 25%  reduce 0%
    060511 112428  map 50%  reduce 0%
    060511 122241  map 75%  reduce 0%
    060511 133613  map 100%  reduce 0%
    060511 133823  map 100%  reduce 100%
    060511 133824 Job complete: job_0003
    060511 133824 Fetcher: done

17) UPDATE CRAWLDB 

    When the fetcher is complete, we update the database with the results of 
the fetch
    This will add to the database entries for all of the pages referenced by 
the initial set
    in dmoz file.

    bin/nutch updatedb test/crawldb   test/segments/20060511101525

    060511 134940 CrawlDb update: starting
    060511 134940 CrawlDb update: db: test/crawldb
    060511 134940 CrawlDb update: segment: test/segments/20060511101525
    060511 134940 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060511 134940 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
    060511 134940 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
    060511 134940 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
    060511 134940 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
    060511 134940 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060511 134940 Client connection to 10.234.57.38:9010: starting
    060511 134940 CrawlDb update: Merging segment data into db.
    060511 134940 Client connection to 10.234.57.38:9011: starting
    060511 134940 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060511 134940 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060511 134941 Running job: job_0004
    060511 134942  map 0%  reduce 0%
    060511 134954  map 17%  reduce 0%
    060511 135004  map 25%  reduce 0%
    060511 135013  map 33%  reduce 0%
    060511 135023  map 42%  reduce 0%
    060511 135024  map 50%  reduce 0%
    060511 135034  map 58%  reduce 0%
    060511 135044  map 67%  reduce 0%
    060511 135054  map 83%  reduce 0%
    060511 135104  map 92%  reduce 0%
    060511 135114  map 100%  reduce 0%
    060511 135124  map 100%  reduce 100%
    060511 135125 Job complete: job_0004
    060511 135125 CrawlDb update: done

    A) We can now see the crawl statistics: 
    
       bin/nutch  readdb test/crawldb  -stats

       060511 135340 CrawlDb statistics start: test/crawldb
       060511 135340 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
       060511 135340 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
       060511 135340 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
       060511 135340 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
       060511 135340 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
       060511 135340 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
       060511 135340 Client connection to 10.234.57.38:9010: starting
       060511 135340 Client connection to 10.234.57.38:9011: starting
       060511 135340 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
       060511 135341 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
       060511 135341 Running job: job_0005
       060511 135342  map 0%  reduce 0%
       060511 135353  map 25%  reduce 0%
       060511 135354  map 50%  reduce 0%
       060511 135405  map 75%  reduce 0%
       060511 135414  map 100%  reduce 0%
       060511 135424  map 100%  reduce 25%
       060511 135425  map 100%  reduce 50%
       060511 135434  map 100%  reduce 75%
       060511 135444  map 100%  reduce 100%
       060511 135445 Job complete: job_0005
       060511 135445 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
       060511 135445 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
       060511 135445 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
       060511 135445 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
       060511 135445 Statistics for CrawlDb: test/crawldb
       060511 135445 TOTAL urls:       585055
       060511 135445 avg score:        1.068
       060511 135445 max score:        185.981
       060511 135445 min score:        1.0
       060511 135445 retry 0:  583943
       060511 135445 retry 1:  1112
       060511 135445 status 1 (DB_unfetched):  540202
       060511 135445 status 2 (DB_fetched):    43086
       060511 135445 status 3 (DB_gone):       1767
       060511 135445 CrawlDb statistics: don

       "I believe the retry numbers are the number of times page fetches failed 
        for recoverable errors and were re-processed before the page was 
        fetched.  So most of the pages were fetched on the first try.  Some 
        encountered errors and were fetched on the next try and so on.  The 
        default setting is a max 3 retrys in the db.fetch.retry.max property."

     
     B) We can now dump the crawled db to a flat file into dfs and get a copy 
out to a local file

        bin/nutch readdb test/crawldb  -dump  mydump

        060511 135603 CrawlDb dump: starting
        060511 135603 CrawlDb db: test/crawldb
        060511 135603 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
        060511 135603 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
        060511 135603 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
        060511 135603 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
        060511 135603 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
        060511 135603 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
        060511 135603 Client connection to 10.234.57.38:9010: starting
        060511 135603 Client connection to 10.234.57.38:9011: starting
        060511 135603 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
        060511 135603 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
        060511 135604 Running job: job_0006
        060511 135605  map 0%  reduce 0%
        060511 135624  map 50%  reduce 0%
        060511 135634  map 75%  reduce 0%
        060511 135644  map 100%  reduce 0%
        060511 135654  map 100%  reduce 25%
        060511 135704  map 100%  reduce 100%
        060511 135705 Job complete: job_0006
        060511 135705 CrawlDb dump: done

        bin/hadoop dfs -lsr mydump
        060511 135802 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
        060511 135802 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
        060511 135803 No FS indicated, using default:myhost1.mydomain.org:9010
        060511 135803 Client connection to 10.234.57.38:9010: starting
        /user/nutch/mydump/part-00000   <r 2>   39031197
        /user/nutch/mydump/part-00001   <r 2>   39186940
        /user/nutch/mydump/part-00002   <r 2>   38954809
        /user/nutch/mydump/part-00003   <r 2>   39171283

     
        bin/hadoop dfs  -get  mydump/part-00000 mydumpFile

        060511 135848 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
        060511 135848 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
        060511 135848 No FS indicated, using default:myhost1.mydomain.org:9010
        060511 135848 Client connection to 10.234.57.38:9010: starting
 
        more mydumpFile
        
gopher://csf.Colorado.EDU/11/ipe/Thematic_Archive/newsletters/africa_information_afrique_net/Angola
     Version: 4
        Status: 1 (DB_unfetched)
        Fetch time: Thu May 11 13:38:09 CEST 2006
        Modified time: Thu Jan 01 01:00:00 CET 1970
        Retries since fetch: 0
        Retry interval: 30.0 days
        Score: 1.0666667
        Signature: null
        Metadata: null
        
        gopher://gopher.gwdg.de/11/Uni/igdl     Version: 4
        Status: 1 (DB_unfetched)
        Fetch time: Thu May 11 13:37:03 CEST 2006
        Modified time: Thu Jan 01 01:00:00 CET 1970
        Retries since fetch: 0
        Retry interval: 30.0 days
        Score: 1.0140845
        Signature: null
        Metadata: null
        
        gopher://gopher.jer1.co.il:70/00/jorgs/npo/camera/media/1994/npr        
Version: 4
        Status: 1 (DB_unfetched)
        Fetch time: Thu May 11 13:36:48 CEST 2006
        Modified time: Thu Jan 01 01:00:00 CET 1970
        Retries since fetch: 0
        Retry interval: 30.0 days
        Score: 1.0105263
        Signature: null
        Metadata: null

        ...
        ...
        ...
   
18) INVERT LINKS

    Before indexing we first invert all of the links, so that we may index 
incoming anchor text with the pages.
    We now need to generate a linkDb, that is done with all segments in your 
segments folder

    bin/nutch invertlinks linkdb test/segments/20060511101525

    060511 140228 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
    060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
    060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060511 140228 LinkDb: starting
    060511 140228 LinkDb: linkdb: linkdb
    060511 140228 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060511 140228 Client connection to 10.234.57.38:9010: starting
    060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
    060511 140228 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
    060511 140228 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
    060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
    060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060511 140228 LinkDb: adding segment: test/segments/20060511101525
    060511 140228 Client connection to 10.234.57.38:9011: starting
    060511 140228 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
    060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060511 140229 Running job: job_0007
    060511 140230  map 0%  reduce 0%
    060511 140255  map 50%  reduce 0%
    060511 140305  map 75%  reduce 0%
    060511 140314  map 100%  reduce 0%
    060511 140324  map 100%  reduce 100%
    060511 140325 Job complete: job_0007
    060511 140325 LinkDb: done

23) INDEX SEGMENT

    To index the segment we use the index command, as follows.

    bin/nutch  index  test/indexes test/crawldb linkdb 
test/segments/20060511101525

    060515 134738 Indexer: starting
    060515 134738 Indexer: linkdb: linkdb
    060515 134738 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.1.jar!/hadoop-default.xml
    060515 134738 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
    060515 134738 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.1.jar!/mapred-default.xml
    060515 134738 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.1.jar!/mapred-default.xml
    060515 134738 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
    060515 134738 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060515 134738 Indexer: adding segment: test/segments/20060511101525
    060515 134738 Client connection to 10.234.57.38:9010: starting
    060515 134738 Client connection to 10.234.57.38:9011: starting
    060515 134739 parsing 
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.1.jar!/hadoop-default.xml
    060515 134739 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
    060515 134739 Running job: job_0006
    060515 134741  map 0%  reduce 0%
    060515 134758  map 11%  reduce 0%
    060515 134808  map 18%  reduce 0%
    060515 134818  map 25%  reduce 0%
    060515 134827  map 38%  reduce 2%
    060515 134837  map 44%  reduce 2%
    060515 134847  map 50%  reduce 9%
    060515 134857  map 53%  reduce 11%
    060515 134908  map 59%  reduce 13%
    060515 134918  map 66%  reduce 13%
    060515 134928  map 71%  reduce 13%
    060515 134938  map 74%  reduce 13%
    060515 134948  map 88%  reduce 16%
    060515 134957  map 94%  reduce 17%
    060515 135007  map 100%  reduce 22%
    060515 135017  map 100%  reduce 50%
    060515 135028  map 100%  reduce 78%
    060515 135038  map 100%  reduce 82%
    060515 135048  map 100%  reduce 87%
    060515 135058  map 100%  reduce 92%
    060515 135108  map 100%  reduce 97%
    060515 135117  map 100%  reduce 99%
    060515 135118  map 100%  reduce 100%
    060515 135129 Job complete: job_0006
    060515 135129 Indexer: done

24) Try Searching the engine  using nutch itself
    
    Nutch  looks for index and segements subdirectory of dfs in the directory 
defined by th searcher.dir property.
    edit the /nutch-site.xml and add the following lines:
    
       <property>
          <name>searcher.dir</name>
          <value>test</value>
          <description>
              Path to root of crawl.  This directory is searched (in order)
              for either the file search-servers.txt, containing a list of
              distributed search servers, or the directory "index" containing
              merged indexes, or the directory "segments" containing segment
              indexes.
        </description>
        </property>

    This is where search look for stuff as explained in description.
    Now run the search using nutch itself, 

    Example : 

       /opt/nutch/bin/nutch  org.apache.nutch.searcher.NutchBean developpement

26) Search the engine using the brawser.

    To search you need to have tomcat installed and put the nutch war file into 
tomcat servlet container.
    I have build and installed tomcat as /opt/tomcat

    Note: (important)
          Something interesting to note about the distributed filesystem is 
that it is user specific. 
          If you store a directory urls under the filesystem with the nutch 
user, it is actually stored as /user/nutch/urls. 
          What this means to us is that the user that does the crawl and stores 
it in the distributed filesystem 
          must also be the user that starts the search, or no results will come 
back. 
          You can try this yourself by logging in with a different user and 
runing the ls command.
          It won't find the directories because is it looking under a different 
directory /user/username instead of /user/nutch

    As explained above we need to run tomcat as nutch user in order to be sure 
to have search results; Be sure to have write 
    permission to nutch logs directory and read permission on the rest of the 
tomcat installation:

    login as root 
    chmod -R ugo+rx  /opt/nutch
    chmod -R ugo+rwx /opt/nutch/logs
 
    export CATALINA_OPTS="-server -Xss256k -Xms768m -Xmx768m 
-Djava.net.preferIPv4Stack=true -Djava.awt.headless=true" 

    rm -rf /opt/tomcat/webapps/ROOT*
    cp /opt/nutch/nutch*.war /opt/tomcat/webapps/ROOT.war
    /opt/tomcat/bin/startup.sh
    
    this should create a new webapps/ROOT rootdir

    We now have to ensure that the webapp (tomcat) can find the index and 
segments.
    Tomcat webapp will use the nutch configuration file under 
      
    /opt/tomcat/webapps/ROOT/WEB-INF/classes
    
    copy in here your modified nutch configuration files from nutch/conf 
directory:

    cp /opt/nutch/conf/hadoop-site.xml 
/opt/tomcat/webapps/ROOT/WEB-INF/classes//hadoop-site.xml
    cp /opt/nutch/conf/hadoop-env.sh   
/opt/tomcat/webapps/ROOT/WEB-INF/classes//hadoop-env.sh
    cp /opt/nutch/conf/nutch-site.xml  
/opt/tomcat/webapps/ROOT/WEB-INF/classes//hadoop-site.xml

    now you will need to restart tomcat and enter the following URL into your 
brawser:

    http://localhost:8080

    the nutch search page should appear

27)  RECRAWLING

    Now that everything works we update our db with new URLS 

    A) we create the fetch list with the to 100 scoring pages in the current DB

       bin/nutch generate test/crawldb test/segments -topN 100

       this has generated the new segment : test/segments/20060516135945

    B) Now we fetch the new pages
     
       bin/nutch fetch test/segments/20060516135945

    C) The DB is now updated with the entries of the new pages

       bin/nutch updatedb test/crawldb test/segments/20060516135945

    D) We now we inver links.  I guess the I could have just invert links on 
test/segments/20060516135945
       but here I do it on all segments

       bin/nutch invertlinks linkdb -dir test/segments

    E) Remove the test/indexes directory
       
       hadoop dfs -rm test/indexes
       
    F) Now we recreate indexes 

       nutch index test/indexes test/crawldb linkdb 
test/segments/20060511101525 test/segments/20060516135945

    G) DEDUP

       bin/nutch dedup test/indexes

    H) Merge indexes 

       bin/nutch merge test/index test/indexes

    I) Now if you would like you can evene remove test/indexes


    I have also tried to index segment in separate indexes directory like :

    nutch index test/indexes1 linkdb test/segments/20060511101525
    nutch index test/indexes2 linkdb test/segments/20060516135945
    bin/nutch merge test/index test/indexes1 test/indexes2

    it looks is working and this will avoid to index each segment all time
    we will instead index just the new segment and we just have to regenerate 
the new meeged
    index


    Another solution for merging coulde have been to index each segment into a 
different index directory:
    
    nutch index indexe1 test/crawldb linkdb test/segments/20060511101525 
    nutch index indexe2 test/crawldb linkdb test/segments/20060516135945
    nutch merge test/index test/indexe1 test/indexe2 

    Another solution again is to merge the segment and index only the resulting 
merged segment
    but so far I did'nt succeed in doing so.



#
#
#nutch crawl dmoz/urls -dir crawl-tinysite -depth 10

Nutch Step by Step Maybe someone will find this useful ?

Reply via email to