I have spent sometime playing with nutch-0 and collecting notes from the
mailing lists ...
may be someone will find these notes useful end could point me out
mistakes
I am not at all a nutch expert...
-Corrado
0) CREATE NUTCH USER AND GROUP
Create a nutch user and group and perform all the following logged in as
nutch user.
put this line in your .bash_profile
export JAVA_HOME=/opt/jdk
export PATH=$JAVA_HOME/bin:$PATH
1) GET HADOOP and NUTCH
downloaded the nutch and hadoop trunks as well explained on
http://lucene.apache.org/hadoop/version_control.html
(svn checkout http://svn.apache.org/repos/asf/lucene/nutch/trunk)
(svn checkout http://svn.apache.org/repos/asf/lucene/hadoop/trunk)
2) BUILD HADOOP
Ex:
Build and produce the tar file
cd hadoop/trunk
ant tar
To build hadoop with native libraries 64bits proceed as follow :
A ) dowonload and install latest lzo library
(http://www.oberhumer.com/opensource/lzo/download/)
Note: the current available pkgs for fc5 are too old
tar xvzf lzo-2.02.tar.gz
cd lzo-2.02
./configure --prefix=/opt/lzo-2.02
make install
B) compile native 64bit libs for hadoop if needed
cd hadoop/trunk/src/native
export LDFLAGS=-L/opt/jdk/jre/lib/amd64/server
export JVM_DATA_MODEL=64
CCFLAGS="-I/opt/lzo-2.02/include" CPPFLAGS="-I/opt/lzo-2.02/include"
./configure
cp src/org_apache_hadoop.h src/org/apache/hadoop/io/compress/zlib/
cp src/org_apache_hadoop.h ./src/org/apache/hadoop/io/compress/lzo
cp
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibCompressor.h
cp
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibDecompressor.h
in config.h replace the line
#define HADOOP_LZO_LIBRARY libnotfound.so
with this one
#define HADOOP_LZO_LIBRARY "liblzo2.so"
make
3) BUILD NUTCH
nutch-dev nigthly trunk now comes with hadoop-0.12.jar but may be you want
to put the last nightly build hadoop jar
mv nutch/trunk/lib/hadoop-0.12.jar nutch/trunk/lib/hadoop-0.12.jar.ori
cp hadoop/trunk/build/hadoop-0.12.3-dev.jar nutch/trunk/lib/hadoop-0.12.jar
cd nutch/trunk
ant tar
4) INSTALL
copy and untar the genearated .tar.gz file on the machines that will
participate to the engine activities
In my case I only have two identical machines available called myhost2 and
myhost1.
On each of them I have installed nutch binaries under /opt/nutch while I
have dicided to have the hadoop
distributed filesystem in a directory called hadoopFs located under a large
disk munted on /disk10
on both machines create the directory:
mkdir /disk10/hadoopFs/
copy hadoop 64bit native libraries if needed
mkdir /opt/nutch/lib/native/Linux-x86_64
cp -fl hadoop/trunk/src/native/lib/.libs/*
/opt/nutch/lib/native/Linux-x86_64
5) CONFIG
I will use the myhost1 as the master machine running the nodename and
jobtracker tasks; it will also run the datanode and tasktraker on it.
myhost2 will only run datanode and takstraker.
A) on both the machines change the conf/hadoop-site.xml configuration file.
Here are values I have used
fs.default.name : myhost1.mydomain.org:9010
mapred.job.tracker : myhost1.mydomain.org:9011
mapred.map.tasks : 40
mapred.reduce.tasks : 3
dfs.name.dir : /opt/hadoopFs/name
dfs.data.dir : /opt/hadoopFs/data
mapred.system.dir : /opt/hadoopFs/mapreduce/system
mapred.local.dir : /opt/hadoopFs/mapreduce/local
dfs.replication : 2
"The mapred.map.tasks property tell how many tasks you want to run in
parallel.
This should be a multiple of the number of computers that you have.
In our case since we are starting out with 2 computer we will have 4
map and 4 reduce tasks.
"The dfs.replication property states how many servers a single file
should be
replicated to before it becomes available. Because we are using 2
servers I have set
this at 2.
may be you want also change nutch-site by adding with a different value
then the default of 3
http.redirect.max : 10
B) be sure that your conf/slaves file contains the name of the slaves
machines. In my cases:
myhost1.mydomain.org
myhost2.mydomain.org
C) create directories for pids and log files on both machines
mkdir /opt/nutch/pids
mkdir /opt/nutch/logs
D) on both machines change conf/hadoop-env.sh file to point to the right
java and nutch installation.
export HADOOP_HOME=/opt/nutch
export JAVA_HOME=/opt/jdk
export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
export HADOOP_PID_DIR=${HADOOP_HOME}/pids
E) Because of a problem on the classloader in nutch the following lines
need to be set
in nutch/bin/hadoop script file before it star building the CLASSSPATH
variable
for f in $HADOOP_HOME/nutch-*.jar; do
CLASSPATH=${CLASSPATH}:$f;
done
This will put nutch-*.jar file into CLASSPATH
6) SSH SETUP ( Important!! )
Setup ssh as explained in http://wiki.apache.org/nutch/NutchHadoopTutorial
and test the ability to password-less login on itself and from myhost1 to
bas24 and viceversa.
This is a very important step to avoid communication refused problems
between daemons.
Here is a short example on how to proceed :
A) use ssh-keygen to create .ssh/id_dsa files :
ssh-keygen -t dsa
Generating public/private dsa key pair.
Enter file in which to save the key (/home/nutch/.ssh/id_dsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/nutch/.ssh/id_dsa.
Your public key has been saved in /home/nutch/.ssh/id_dsa.pub.
The key fingerprint is:
01:36:6c:9d:27:09:54:e4:ff:fb:20:86:8c:e1:6c:82 [EMAIL PROTECTED]
B) copy .ssh/id_dsa.pub on all machines as .ssh/authorized_keys
C) on each machine configure ssh-agent to start at login adding a line in
.xsession
ex : ssh-agent startkde.
or eval `ssh-agent` in .bashrc ( this will start an ssh-agent for every
new shell)
D) Use ssh-ad to add the dsa key
ssh-add
Enter passphrase for /home/nutch/.ssh/id_dsa:
Identity added: /home/nutch/.ssh/id_dsa (/home/nutch/.ssh/id_dsa)
7) FORMAT HADOOP FILESYSTEM
"Fix for HADOOP-19. A namenode must now be formatted before it may be
used. Attempts to
start a namenode in an unformatted directory will fail, rather than
automatically
creating a new, empty filesystem, causing existing datanodes to delete all
blocks.
Thus a mis-configured dfs.data.dir should no longer cause data loss"
on the master machine (myhost1) run these command:
cd /opt/nutch/
bin/hadoop namenode -format
This will create the /opt/hadoopFs/name/image directory
8) START NODENAME
start the namenode on the master machine (myhost1)
bin/hadoop-daemon.sh start namenode
starting namenode, logging to
/opt/nutch/logs/hadoop-nutch-namenode-myhost1.mydomain.org.out
060509 150431 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060509 150431 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060509 150431 directing logs to directory /opt/nutch/logs
9) START DATANODES
starting datanode on the master and all slaves machines (myhost1 and
myhost2)
on myhost1:
bin/hadoop-daemon.sh start datanode
tarting datanode, logging to
/opt/nutch/logs/hadoop-nutch-datanode-myhost1.mydomain.org.out
060509 150619 0x0000000a parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060509 150619 0x0000000a parsing
file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060509 150619 0x0000000a directing logs to directory /opt/nutch/logs
on myhost2:
bin/hadoop-daemon.sh start datanode
starting datanode, logging to
/opt/nutch/logs/hadoop-nutch-datanode-myhost2.mydomain.org.out
060509 151517 0x0000000a parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060509 151517 0x0000000a parsing
file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060509 151517 0x0000000a directing logs to directory /opt/nutch/logs
10) START JOBTRAKER
start jobtracker on the master machine (myhost1)
on myhost1
bin/hadoop-daemon.sh start jobtracker
starting jobtracker, logging to
/opt/nutch/logs/hadoop-nutch-jobtracker-myhost1.mydomain.org.out
060509 152020 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060509 152021 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060509 152021 directing logs to directory /opt/nutch/logs
11) START TASKTARKERS
start tasktracker on the slaves machines (myhost2 and myhost1)
on myhost1:
bin/hadoop-daemon.sh start tasktracker
starting tasktracker, logging to
/opt/nutch/logs/hadoop-nutch-tasktracker-myhost1.mydomain.org.out
060509 152236 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060509 152236 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
060509 152236 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060509 152236 directing logs to directory /opt/nutch/logs
on myhost2:
bin/hadoop-daemon.sh start tasktracker
starting tasktracker, logging to
/opt/nutch/logs/hadoop-nutch-tasktracker-myhost2.mydomain.org.out
060509 152333 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060509 152333 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
060509 152333 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060509 152333 directing logs to directory /opt/nutch/logs
NOTE: Now that we have verified that daemons start and connects properly we
can star and
stop all of them using the start-all.sh and stop-all. scripts from
the master machine
12) TEST FUNCTIONALITY
Test hadoop functionality ... just a simple ls
bin/hadoop dfs -ls
060509 152844 parsing
jar:file:/home/opt/nutch-0.8-dev/lib/hadoop-0.2-dev.jar!/hadoop-default.xml
060509 152845 parsing file:/home/opt/nutch-0.8-dev/conf/hadoop-site.xml
060509 152845 No FS indicated, using default:localhost:9010
060509 152845 Client connection to 127.0.0.1:9010: starting
Found 0 items
The dfs filesystem is empty.. of course..
13) CRATE FILE FOR URLs INJECT
Now we need to create a crawldb and inject URLs in it. These initial URLs
will be used then for the initial crawling.
Let's inject URLs from the DMOZ Open Directory. First we must download and
uncompress the file listing all of the DMOZ pages.
(This is about 300MB compressed file, which uncompressed has 2GB in size,
so this will take a few minutes.)
on myhost1 machine where we run the nodename:
cd /disk10
wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz
mkdir dmzo
A) 5 Milion pages
DMOZ contains around 5 million URLs.
/opt/nutch-0.8-dev/bin/nutch org.apache.nutch.tools.DmozParser
content.rdf.u8 > dmoz/urls
060510 104615 parsing
jar:file:/home/opt/nutch-0.8-dev/lib/hadoop-0.2-dev.jar!/hadoop-default.xml
060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/nutch-default.xml
060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/nutch-site.xml
060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/hadoop-site.xml
060510 104615 skew = -2131431075
060510 104615 Begin parse
060510 104616 Client connection to myhost1:9010: starting
060510 105156 Completed parse. Found 4756391 pages.
B) as as second choice we can also select a random subset of these pages.
(We can use a random subset so that everyone who runs this tutorial
doesn't hammer the same sites.)
DMOZ contains around five million URLs. We select one out of every
1000, so that we end up with around 50000 of URLs:
bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 100 >
dmoz/urls
060510 104615 parsing
jar:file:/home/opt/nutch-0.8-dev/lib/hadoop-0.2-dev.jar!/hadoop-default.xml
060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/nutch-default.xml
060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/nutch-site.xml
060510 104615 parsing file:/home/opt/nutch-0.8-dev/conf/hadoop-site.xml
060510 104615 skew = -736060357
060510 104615 Begin parse
060510 104615 Client connection to myhost1:9010: starting
060510 104615 Completed parse. Found 49498 pages.
Here I go for choice B
The parser also takes a few minutes, as it must parse the full 2GB file.
Finally, we initialize the crawl db with the selected urls.
bin/hadoop dfs -put /disk10/dmoz dmoz
060510 101321 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060510 101321 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060510 101321 No FS indicated, using default:myhost1.mydomain.org:9010
060510 101321 Client connection to 10.234.57.38:9010: starting
060510 101321 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060510 101321 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
bin/hadoop dfs -lsr dmoz
060510 134738 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060510 134738 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060510 134738 No FS indicated, using default:myhost1.mydomain.org:9010
060510 134738 Client connection to 10.234.57.38:9010: starting
/user/nutch/dmoz <dir>
/user/nutch/dmoz/urls <r 2> 57059180
14) CREATE CRAWLDB (INJECT URLs)
create e crawldb and inject the urls into the web database.
bin/nutch inject test/crawldb dmoz
060511 092330 Injector: starting
060511 092330 Injector: crawlDb: test/crawldb
060511 092330 Injector: urlDir: dmoz
060511 092330 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060511 092330 Injector: Converting injected urls to crawl db entries.
060511 092330 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
060511 092330 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
060511 092330 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060511 092330 Client connection to 10.234.57.38:9010: starting
060511 092330 Client connection to 10.234.57.38:9011: starting
060511 092330 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060511 092330 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060511 092332 Running job: job_0001
060511 092333 map 0% reduce 0%
060511 092342 map 25% reduce 0%
060511 092344 map 50% reduce 0%
060511 092354 map 75% reduce 0%
060511 092402 map 100% reduce 0%
060511 092412 map 100% reduce 25%
060511 092414 map 100% reduce 75%
060511 092422 map 100% reduce 100%
060511 092423 Job complete: job_0001
060511 092423 Injector: Merging injected urls into crawl db.
060511 092423 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060511 092423 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
060511 092423 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
060511 092423 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
060511 092423 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
060511 092423 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060511 092424 Running job: job_0002
060511 092425 map 0% reduce 0%
060511 092442 map 25% reduce 0%
060511 092444 map 50% reduce 0%
060511 092454 map 75% reduce 0%
060511 092502 map 100% reduce 0%
060511 092511 map 100% reduce 25%
060511 092513 map 100% reduce 75%
060511 092522 map 100% reduce 100%
060511 092523 Job complete: job_0002
060511 092523 Injector: done
this will create the test/crawldb folders int the dfs
From nutch tutorial :
"The crawl database, or crawldb. This contains information about every
url known to Nutch,
including whether it was fetched, and, if so, when."
You can also see that the fisical filesystem were we put dsf as also
changed few data
block files have been created. This on both myhost1 and myhost2 machines
which participate to the dfs
tree /disk10/hadoopFs
/disk10/hadoopFs
|-- data
| |-- data
| | |-- blk_-1388015236827939264
| | |-- blk_-2961663541591843930
| | |-- blk_-3901036791232325566
| | |-- blk_-5212946459038293740
| | |-- blk_-5301517582607663382
| | |-- blk_-7397383874477738842
| | |-- blk_-9055045635688102499
| | |-- blk_-9056717903919576858
| | |-- blk_1330666339588899715
| | |-- blk_1868647544763144796
| | |-- blk_3136516483028291673
| | |-- blk_4297959992285923734
| | |-- blk_5111098874834542511
| | |-- blk_5224195282207865093
| | |-- blk_5554003155307698150
| | |-- blk_7122181909600991812
| | |-- blk_8745902888438265091
| | `-- blk_883778723937265061
| `-- tmp
|-- mapreduce
`-- name
|-- edits
`-- image
`-- fsimage
nutch readdb test/crawldb -dump tmp/crawldbDump1
hadoop dfs -lsr
hadoop dfs -get tmp/crawldbDump1 tmp/
15) CREATE FETCHLIST
To fetch, we first need to generate a fetchlist from the injected URLs in
the database.
This generates a fetchlist for all of the pages due to be fetched.
The fetchlist is placed in a newly created segment directory.
The segment directory is named by the time it's created.
bin/nutch generate test/crawldb test/segments
060511 101525 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060511 101525 Generator: starting
060511 101525 Generator: segment: test/segments/20060511101525
060511 101525 Generator: Selecting most-linked urls due for fetch.
060511 101525 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
060511 101525 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
060511 101525 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060511 101525 Client connection to 10.234.57.38:9010: starting
060511 101525 Client connection to 10.234.57.38:9011: starting
060511 101525 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060511 101525 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060511 101527 Running job: job_0001
060511 101528 map 0% reduce 0%
060511 101546 map 50% reduce 0%
060511 101556 map 75% reduce 0%
060511 101606 map 100% reduce 0%
060511 101616 map 100% reduce 75%
060511 101626 map 100% reduce 100%
060511 101627 Job complete: job_0001
060511 101627 Generator: Partitioning selected urls by host, for politeness.
060511 101627 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060511 101627 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
060511 101627 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
060511 101627 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
060511 101627 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
060511 101627 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060511 101628 Running job: job_0002
060511 101629 map 0% reduce 0%
060511 101646 map 40% reduce 0%
060511 101656 map 60% reduce 0%
060511 101706 map 80% reduce 0%
060511 101717 map 100% reduce 0%
060511 101726 map 100% reduce 100%
060511 101727 Job complete: job_0002
060511 101727 Generator: done
At the end of this will have the new fetchlist created in
test/segments/20060511101525/crawl_generate/part-00000 <r 2> 777933
test/segments/20060511101525/crawl_generate/part-00001 <r 2> 751088
test/segments/20060511101525/crawl_generate/part-00002 <r 2> 988871
test/segments/20060511101525/crawl_generate/part-00003 <r 2> 833454
nutch readseg -dump test/segments/20061027135841
test/segments/20061027135841/gendump -nocontent -nofetch -noparse -noparsedata
-noparsetext
16) FETCH
Now we run the fetcher on the created segment. This will load the web pages
into the segment.
bin/nutch fetch test/segments/20060511101525
060511 101820 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060511 101820 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060511 101821 Fetcher: starting
060511 101821 Fetcher: segment: test/segments/20060511101525
060511 101821 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
060511 101821 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
060511 101821 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060511 101821 Client connection to 10.234.57.38:9011: starting
060511 101821 Client connection to 10.234.57.38:9010: starting
060511 101821 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060511 101821 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060511 101822 Running job: job_0003
060511 101823 map 0% reduce 0%
060511 110818 map 25% reduce 0%
060511 112428 map 50% reduce 0%
060511 122241 map 75% reduce 0%
060511 133613 map 100% reduce 0%
060511 133823 map 100% reduce 100%
060511 133824 Job complete: job_0003
060511 133824 Fetcher: done
17) UPDATE CRAWLDB
When the fetcher is complete, we update the database with the results of
the fetch
This will add to the database entries for all of the pages referenced by
the initial set
in dmoz file.
bin/nutch updatedb test/crawldb test/segments/20060511101525
060511 134940 CrawlDb update: starting
060511 134940 CrawlDb update: db: test/crawldb
060511 134940 CrawlDb update: segment: test/segments/20060511101525
060511 134940 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060511 134940 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
060511 134940 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
060511 134940 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
060511 134940 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
060511 134940 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060511 134940 Client connection to 10.234.57.38:9010: starting
060511 134940 CrawlDb update: Merging segment data into db.
060511 134940 Client connection to 10.234.57.38:9011: starting
060511 134940 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060511 134940 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060511 134941 Running job: job_0004
060511 134942 map 0% reduce 0%
060511 134954 map 17% reduce 0%
060511 135004 map 25% reduce 0%
060511 135013 map 33% reduce 0%
060511 135023 map 42% reduce 0%
060511 135024 map 50% reduce 0%
060511 135034 map 58% reduce 0%
060511 135044 map 67% reduce 0%
060511 135054 map 83% reduce 0%
060511 135104 map 92% reduce 0%
060511 135114 map 100% reduce 0%
060511 135124 map 100% reduce 100%
060511 135125 Job complete: job_0004
060511 135125 CrawlDb update: done
A) We can now see the crawl statistics:
bin/nutch readdb test/crawldb -stats
060511 135340 CrawlDb statistics start: test/crawldb
060511 135340 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060511 135340 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
060511 135340 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
060511 135340 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
060511 135340 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
060511 135340 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060511 135340 Client connection to 10.234.57.38:9010: starting
060511 135340 Client connection to 10.234.57.38:9011: starting
060511 135340 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060511 135341 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060511 135341 Running job: job_0005
060511 135342 map 0% reduce 0%
060511 135353 map 25% reduce 0%
060511 135354 map 50% reduce 0%
060511 135405 map 75% reduce 0%
060511 135414 map 100% reduce 0%
060511 135424 map 100% reduce 25%
060511 135425 map 100% reduce 50%
060511 135434 map 100% reduce 75%
060511 135444 map 100% reduce 100%
060511 135445 Job complete: job_0005
060511 135445 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060511 135445 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
060511 135445 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
060511 135445 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060511 135445 Statistics for CrawlDb: test/crawldb
060511 135445 TOTAL urls: 585055
060511 135445 avg score: 1.068
060511 135445 max score: 185.981
060511 135445 min score: 1.0
060511 135445 retry 0: 583943
060511 135445 retry 1: 1112
060511 135445 status 1 (DB_unfetched): 540202
060511 135445 status 2 (DB_fetched): 43086
060511 135445 status 3 (DB_gone): 1767
060511 135445 CrawlDb statistics: don
"I believe the retry numbers are the number of times page fetches failed
for recoverable errors and were re-processed before the page was
fetched. So most of the pages were fetched on the first try. Some
encountered errors and were fetched on the next try and so on. The
default setting is a max 3 retrys in the db.fetch.retry.max property."
B) We can now dump the crawled db to a flat file into dfs and get a copy
out to a local file
bin/nutch readdb test/crawldb -dump mydump
060511 135603 CrawlDb dump: starting
060511 135603 CrawlDb db: test/crawldb
060511 135603 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060511 135603 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
060511 135603 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
060511 135603 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
060511 135603 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
060511 135603 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060511 135603 Client connection to 10.234.57.38:9010: starting
060511 135603 Client connection to 10.234.57.38:9011: starting
060511 135603 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060511 135603 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060511 135604 Running job: job_0006
060511 135605 map 0% reduce 0%
060511 135624 map 50% reduce 0%
060511 135634 map 75% reduce 0%
060511 135644 map 100% reduce 0%
060511 135654 map 100% reduce 25%
060511 135704 map 100% reduce 100%
060511 135705 Job complete: job_0006
060511 135705 CrawlDb dump: done
bin/hadoop dfs -lsr mydump
060511 135802 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060511 135802 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060511 135803 No FS indicated, using default:myhost1.mydomain.org:9010
060511 135803 Client connection to 10.234.57.38:9010: starting
/user/nutch/mydump/part-00000 <r 2> 39031197
/user/nutch/mydump/part-00001 <r 2> 39186940
/user/nutch/mydump/part-00002 <r 2> 38954809
/user/nutch/mydump/part-00003 <r 2> 39171283
bin/hadoop dfs -get mydump/part-00000 mydumpFile
060511 135848 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060511 135848 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060511 135848 No FS indicated, using default:myhost1.mydomain.org:9010
060511 135848 Client connection to 10.234.57.38:9010: starting
more mydumpFile
gopher://csf.Colorado.EDU/11/ipe/Thematic_Archive/newsletters/africa_information_afrique_net/Angola
Version: 4
Status: 1 (DB_unfetched)
Fetch time: Thu May 11 13:38:09 CEST 2006
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0666667
Signature: null
Metadata: null
gopher://gopher.gwdg.de/11/Uni/igdl Version: 4
Status: 1 (DB_unfetched)
Fetch time: Thu May 11 13:37:03 CEST 2006
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0140845
Signature: null
Metadata: null
gopher://gopher.jer1.co.il:70/00/jorgs/npo/camera/media/1994/npr
Version: 4
Status: 1 (DB_unfetched)
Fetch time: Thu May 11 13:36:48 CEST 2006
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0105263
Signature: null
Metadata: null
...
...
...
18) INVERT LINKS
Before indexing we first invert all of the links, so that we may index
incoming anchor text with the pages.
We now need to generate a linkDb, that is done with all segments in your
segments folder
bin/nutch invertlinks linkdb test/segments/20060511101525
060511 140228 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060511 140228 LinkDb: starting
060511 140228 LinkDb: linkdb: linkdb
060511 140228 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060511 140228 Client connection to 10.234.57.38:9010: starting
060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
060511 140228 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
060511 140228 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/mapred-default.xml
060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060511 140228 LinkDb: adding segment: test/segments/20060511101525
060511 140228 Client connection to 10.234.57.38:9011: starting
060511 140228 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.0.jar!/hadoop-default.xml
060511 140228 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060511 140229 Running job: job_0007
060511 140230 map 0% reduce 0%
060511 140255 map 50% reduce 0%
060511 140305 map 75% reduce 0%
060511 140314 map 100% reduce 0%
060511 140324 map 100% reduce 100%
060511 140325 Job complete: job_0007
060511 140325 LinkDb: done
23) INDEX SEGMENT
To index the segment we use the index command, as follows.
bin/nutch index test/indexes test/crawldb linkdb
test/segments/20060511101525
060515 134738 Indexer: starting
060515 134738 Indexer: linkdb: linkdb
060515 134738 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.1.jar!/hadoop-default.xml
060515 134738 parsing file:/disk10/nutch-0.8-dev/conf/nutch-default.xml
060515 134738 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.1.jar!/mapred-default.xml
060515 134738 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.1.jar!/mapred-default.xml
060515 134738 parsing file:/disk10/nutch-0.8-dev/conf/nutch-site.xml
060515 134738 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060515 134738 Indexer: adding segment: test/segments/20060511101525
060515 134738 Client connection to 10.234.57.38:9010: starting
060515 134738 Client connection to 10.234.57.38:9011: starting
060515 134739 parsing
jar:file:/disk10/nutch-0.8-dev/lib/hadoop-0.2.1.jar!/hadoop-default.xml
060515 134739 parsing file:/disk10/nutch-0.8-dev/conf/hadoop-site.xml
060515 134739 Running job: job_0006
060515 134741 map 0% reduce 0%
060515 134758 map 11% reduce 0%
060515 134808 map 18% reduce 0%
060515 134818 map 25% reduce 0%
060515 134827 map 38% reduce 2%
060515 134837 map 44% reduce 2%
060515 134847 map 50% reduce 9%
060515 134857 map 53% reduce 11%
060515 134908 map 59% reduce 13%
060515 134918 map 66% reduce 13%
060515 134928 map 71% reduce 13%
060515 134938 map 74% reduce 13%
060515 134948 map 88% reduce 16%
060515 134957 map 94% reduce 17%
060515 135007 map 100% reduce 22%
060515 135017 map 100% reduce 50%
060515 135028 map 100% reduce 78%
060515 135038 map 100% reduce 82%
060515 135048 map 100% reduce 87%
060515 135058 map 100% reduce 92%
060515 135108 map 100% reduce 97%
060515 135117 map 100% reduce 99%
060515 135118 map 100% reduce 100%
060515 135129 Job complete: job_0006
060515 135129 Indexer: done
24) Try Searching the engine using nutch itself
Nutch looks for index and segements subdirectory of dfs in the directory
defined by th searcher.dir property.
edit the /nutch-site.xml and add the following lines:
<property>
<name>searcher.dir</name>
<value>test</value>
<description>
Path to root of crawl. This directory is searched (in order)
for either the file search-servers.txt, containing a list of
distributed search servers, or the directory "index" containing
merged indexes, or the directory "segments" containing segment
indexes.
</description>
</property>
This is where search look for stuff as explained in description.
Now run the search using nutch itself,
Example :
/opt/nutch/bin/nutch org.apache.nutch.searcher.NutchBean developpement
26) Search the engine using the brawser.
To search you need to have tomcat installed and put the nutch war file into
tomcat servlet container.
I have build and installed tomcat as /opt/tomcat
Note: (important)
Something interesting to note about the distributed filesystem is
that it is user specific.
If you store a directory urls under the filesystem with the nutch
user, it is actually stored as /user/nutch/urls.
What this means to us is that the user that does the crawl and stores
it in the distributed filesystem
must also be the user that starts the search, or no results will come
back.
You can try this yourself by logging in with a different user and
runing the ls command.
It won't find the directories because is it looking under a different
directory /user/username instead of /user/nutch
As explained above we need to run tomcat as nutch user in order to be sure
to have search results; Be sure to have write
permission to nutch logs directory and read permission on the rest of the
tomcat installation:
login as root
chmod -R ugo+rx /opt/nutch
chmod -R ugo+rwx /opt/nutch/logs
export CATALINA_OPTS="-server -Xss256k -Xms768m -Xmx768m
-Djava.net.preferIPv4Stack=true -Djava.awt.headless=true"
rm -rf /opt/tomcat/webapps/ROOT*
cp /opt/nutch/nutch*.war /opt/tomcat/webapps/ROOT.war
/opt/tomcat/bin/startup.sh
this should create a new webapps/ROOT rootdir
We now have to ensure that the webapp (tomcat) can find the index and
segments.
Tomcat webapp will use the nutch configuration file under
/opt/tomcat/webapps/ROOT/WEB-INF/classes
copy in here your modified nutch configuration files from nutch/conf
directory:
cp /opt/nutch/conf/hadoop-site.xml
/opt/tomcat/webapps/ROOT/WEB-INF/classes//hadoop-site.xml
cp /opt/nutch/conf/hadoop-env.sh
/opt/tomcat/webapps/ROOT/WEB-INF/classes//hadoop-env.sh
cp /opt/nutch/conf/nutch-site.xml
/opt/tomcat/webapps/ROOT/WEB-INF/classes//hadoop-site.xml
now you will need to restart tomcat and enter the following URL into your
brawser:
http://localhost:8080
the nutch search page should appear
27) RECRAWLING
Now that everything works we update our db with new URLS
A) we create the fetch list with the to 100 scoring pages in the current DB
bin/nutch generate test/crawldb test/segments -topN 100
this has generated the new segment : test/segments/20060516135945
B) Now we fetch the new pages
bin/nutch fetch test/segments/20060516135945
C) The DB is now updated with the entries of the new pages
bin/nutch updatedb test/crawldb test/segments/20060516135945
D) We now we inver links. I guess the I could have just invert links on
test/segments/20060516135945
but here I do it on all segments
bin/nutch invertlinks linkdb -dir test/segments
E) Remove the test/indexes directory
hadoop dfs -rm test/indexes
F) Now we recreate indexes
nutch index test/indexes test/crawldb linkdb
test/segments/20060511101525 test/segments/20060516135945
G) DEDUP
bin/nutch dedup test/indexes
H) Merge indexes
bin/nutch merge test/index test/indexes
I) Now if you would like you can evene remove test/indexes
I have also tried to index segment in separate indexes directory like :
nutch index test/indexes1 linkdb test/segments/20060511101525
nutch index test/indexes2 linkdb test/segments/20060516135945
bin/nutch merge test/index test/indexes1 test/indexes2
it looks is working and this will avoid to index each segment all time
we will instead index just the new segment and we just have to regenerate
the new meeged
index
Another solution for merging coulde have been to index each segment into a
different index directory:
nutch index indexe1 test/crawldb linkdb test/segments/20060511101525
nutch index indexe2 test/crawldb linkdb test/segments/20060516135945
nutch merge test/index test/indexe1 test/indexe2
Another solution again is to merge the segment and index only the resulting
merged segment
but so far I did'nt succeed in doing so.
#
#
#nutch crawl dmoz/urls -dir crawl-tinysite -depth 10