Earl Cahill wrote:
Any chance you could walk through your implementation?
Like how the twenty boxes were assigned? Maybe
upload your confs somewhere, and outline what commands
you actually ran?
All 20 boxes are configured identically, running a Debian 2.4 kernel.
These are dual-processor boxes with 2GB of RAM each. Each machine has
four drives, mounted as a RAID on /export/crawlspace. This cluster uses
NFS to mount home directories, so I did not have to set NUTCH_MASTER in
order to rsync copies of nutch to all machines.
I installed JDK 1.5 in ~/local/java, Ant in ~/local/ant and subversion
in ~/local/svn.
My ~/.ssh/environment contains:
JAVA_HOME=/home/dcutting/local/java
NUTCH_OPTS=-server
NUTCH_LOG_DIR=/export/crawlspace/tmp/dcutting/logs
NUTCH_SLAVES=/home/dcutting/.slaves
I added the following to ~/.bash_profile, then logged out & back in.
export `cat ~/.ssh/environment`
I added the following to /etc/ssh/sshd_config on all hosts:
PermitUserEnvironment yes
My ~/.slaves file contains a list of all 20 slave hosts, one per line.
My ~/src/nutch/conf/mapred-default.xml contains:
<nutch-conf>
<property>
<name>mapred.map.tasks</name>
<value>1000</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>39</value>
</property>
</nutch-conf>
My ~/src/nutch/conf/nutch-site.xml contains:
<nutch-conf>
<property>
<name>fetcher.threads.fetch</name>
<value>100</value>
</property>
<property>
<name>generate.max.per.host</name>
<value>100</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html)|index-basic|query-(basic|site|url)</value>
</property>
<property>
<name>parser.html.impl</name>
<value>tagsoup</value>
</property>
<!-- NDFS -->
<property>
<name>fs.default.name</name>
<value>adminhost:8009</value>
</property>
<property>
<name>ndfs.name.dir</name>
<value>/export/crawlspace/tmp/dcutting/ndfs/names</value>
</property>
<property>
<name>ndfs.data.dir</name>
<value>/export/crawlspace/tmp/dcutting/ndfs</value>
</property>
<!-- MapReduce -->
<property>
<name>mapred.job.tracker</name>
<value>adminhost:8010</value>
</property>
<property>
<name>mapred.system.dir</name>
<value>/mapred/system</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/export/crawlspace/tmp/dcutting/local</value>
</property>
<property>
<name>mapred.child.heap.size</name>
<value>500m</value>
</property>
</nutch-conf>
My ~/src/nutch/conf/crawl-urlfilter.txt contains:
# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/
# accept everything else
+.
To run the crawl I gave the following commands on the master host:
# checkout nutch sources and build them
mkdir ~/src
cd ~/src
~/local/svn co https://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
cd nutch
~/local/ant/bin/ant
# install config files named above in ~/src/nutch/conf
# create dmoz/urls file
wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz
mkdir dmoz
bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8.gz > dmoz/urls
# create required directories on slaves
bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/logs
bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/local
bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/ndfs/names
# start nutch daemons
bin/start-all.sh
# copy dmoz/urls into ndfs
bin/nutch ndfs -put dmoz dmoz
# crawl
nohup bin/nutch crawl dmoz -dir crawl -depth 4 -topN 16000000 <
/dev/null >& crawl.log &
Then I visited http://master:50030/ to monitor progress.
I think that's it!
Doug