Re: Is any one able to successfully run Distributed Crawl?

Doug Cutting Wed, 04 Jan 2006 15:40:54 -0800

Earl Cahill wrote:

Any chance you could walk through your implementation?
 Like how the twenty boxes were assigned?  Maybe
upload your confs somewhere, and outline what commands
you actually ran?

All 20 boxes are configured identically, running a Debian 2.4 kernel.These are dual-processor boxes with 2GB of RAM each. Each machine hasfour drives, mounted as a RAID on /export/crawlspace. This cluster usesNFS to mount home directories, so I did not have to set NUTCH_MASTER inorder to rsync copies of nutch to all machines.

I installed JDK 1.5 in ~/local/java, Ant in ~/local/ant and subversionin ~/local/svn.


My ~/.ssh/environment contains:

JAVA_HOME=/home/dcutting/local/java
NUTCH_OPTS=-server
NUTCH_LOG_DIR=/export/crawlspace/tmp/dcutting/logs
NUTCH_SLAVES=/home/dcutting/.slaves

I added the following to ~/.bash_profile, then logged out & back in.

export `cat ~/.ssh/environment`

I added the following to /etc/ssh/sshd_config on all hosts:

PermitUserEnvironment yes

My ~/.slaves file contains a list of all 20 slave hosts, one per line.

My ~/src/nutch/conf/mapred-default.xml contains:

<nutch-conf>

<property>
  <name>mapred.map.tasks</name>
  <value>1000</value>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>39</value>
</property>

</nutch-conf>

My ~/src/nutch/conf/nutch-site.xml contains:

<nutch-conf>

<property>
  <name>fetcher.threads.fetch</name>
  <value>100</value>
</property>

<property>
  <name>generate.max.per.host</name>
  <value>100</value>
</property>

<property>
  <name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(html)|index-basic|query-(basic|site|url)</value>
</property>

<property>
  <name>parser.html.impl</name>
  <value>tagsoup</value>
</property>

<!-- NDFS -->

<property>
  <name>fs.default.name</name>
  <value>adminhost:8009</value>
</property>

<property>
  <name>ndfs.name.dir</name>
  <value>/export/crawlspace/tmp/dcutting/ndfs/names</value>
</property>

<property>
  <name>ndfs.data.dir</name>
  <value>/export/crawlspace/tmp/dcutting/ndfs</value>
</property>

<!-- MapReduce -->

<property>
  <name>mapred.job.tracker</name>
  <value>adminhost:8010</value>
</property>

<property>
  <name>mapred.system.dir</name>
  <value>/mapred/system</value>
</property>

<property>
  <name>mapred.local.dir</name>
  <value>/export/crawlspace/tmp/dcutting/local</value>
</property>

<property>
  <name>mapred.child.heap.size</name>
  <value>500m</value>
</property>

</nutch-conf>

My ~/src/nutch/conf/crawl-urlfilter.txt contains:

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to breakloops

-.*(/.+?)/.*?\1/.*?\1/

# accept everything else
+.

To run the crawl I gave the following commands on the master host:

# checkout nutch sources and build them
mkdir ~/src
cd ~/src
~/local/svn co https://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
cd nutch
~/local/ant/bin/ant

# install config files named above in ~/src/nutch/conf

# create dmoz/urls file
wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz
mkdir dmoz
bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8.gz > dmoz/urls

# create required directories on slaves
bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/logs
bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/local
bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/ndfs/names

# start nutch daemons
bin/start-all.sh

# copy dmoz/urls into ndfs
bin/nutch ndfs -put dmoz dmoz

# crawl

nohup bin/nutch crawl dmoz -dir crawl -depth 4 -topN 16000000 </dev/null >& crawl.log &


Then I visited http://master:50030/ to monitor progress.

I think that's it!

Doug

Re: Is any one able to successfully run Distributed Crawl?

Reply via email to