Hi Doug,

Thanks alot for your precious time you gave for writing such a detailed and
informative reply. Just wanted to confirm that this distributed crawl you
did using nutch version 0.7.1 or some other version? And was that a
successful distributed crawl using map reduce or some work around for
distributed crawl?

Thanks and Regards,
Pushpesh


On 1/5/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
>
> Earl Cahill wrote:
> > Any chance you could walk through your implementation?
> >  Like how the twenty boxes were assigned?  Maybe
> > upload your confs somewhere, and outline what commands
> > you actually ran?
>
> All 20 boxes are configured identically, running a Debian 2.4 kernel.
> These are dual-processor boxes with 2GB of RAM each.  Each machine has
> four drives, mounted as a RAID on /export/crawlspace.  This cluster uses
> NFS to mount home directories, so I did not have to set NUTCH_MASTER in
> order to rsync copies of nutch to all machines.
>
> I installed JDK 1.5 in ~/local/java, Ant in ~/local/ant and subversion
> in ~/local/svn.
>
> My ~/.ssh/environment contains:
>
> JAVA_HOME=/home/dcutting/local/java
> NUTCH_OPTS=-server
> NUTCH_LOG_DIR=/export/crawlspace/tmp/dcutting/logs
> NUTCH_SLAVES=/home/dcutting/.slaves
>
> I added the following to ~/.bash_profile, then logged out & back in.
>
> export `cat ~/.ssh/environment`
>
> I added the following to /etc/ssh/sshd_config on all hosts:
>
> PermitUserEnvironment yes
>
> My ~/.slaves file contains a list of all 20 slave hosts, one per line.
>
> My ~/src/nutch/conf/mapred-default.xml contains:
>
> <nutch-conf>
>
> <property>
>   <name>mapred.map.tasks</name>
>   <value>1000</value>
> </property>
>
> <property>
>   <name>mapred.reduce.tasks</name>
>   <value>39</value>
> </property>
>
> </nutch-conf>
>
> My ~/src/nutch/conf/nutch-site.xml contains:
>
> <nutch-conf>
>
> <property>
>   <name>fetcher.threads.fetch</name>
>   <value>100</value>
> </property>
>
> <property>
>   <name>generate.max.per.host</name>
>   <value>100</value>
> </property>
>
> <property>
>   <name>plugin.includes</name>
>
>
> <value>protocol-http|urlfilter-regex|parse-(html)|index-basic|query-(basic|site|url)</value>
> </property>
>
> <property>
>   <name>parser.html.impl</name>
>   <value>tagsoup</value>
> </property>
>
> <!-- NDFS -->
>
> <property>
>   <name>fs.default.name</name>
>   <value>adminhost:8009</value>
> </property>
>
> <property>
>   <name>ndfs.name.dir</name>
>   <value>/export/crawlspace/tmp/dcutting/ndfs/names</value>
> </property>
>
> <property>
>   <name>ndfs.data.dir</name>
>   <value>/export/crawlspace/tmp/dcutting/ndfs</value>
> </property>
>
> <!-- MapReduce -->
>
> <property>
>   <name>mapred.job.tracker</name>
>   <value>adminhost:8010</value>
> </property>
>
> <property>
>   <name>mapred.system.dir</name>
>   <value>/mapred/system</value>
> </property>
>
> <property>
>   <name>mapred.local.dir</name>
>   <value>/export/crawlspace/tmp/dcutting/local</value>
> </property>
>
> <property>
>   <name>mapred.child.heap.size</name>
>   <value>500m</value>
> </property>
>
> </nutch-conf>
>
> My ~/src/nutch/conf/crawl-urlfilter.txt contains:
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept everything else
> +.
>
> To run the crawl I gave the following commands on the master host:
>
> # checkout nutch sources and build them
> mkdir ~/src
> cd ~/src
> ~/local/svn co https://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
> cd nutch
> ~/local/ant/bin/ant
>
> # install config files named above in ~/src/nutch/conf
>
> # create dmoz/urls file
> wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
> gunzip content.rdf.u8.gz
> mkdir dmoz
> bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8.gz > dmoz/urls
>
> # create required directories on slaves
> bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/logs
> bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/local
> bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/ndfs/names
>
> # start nutch daemons
> bin/start-all.sh
>
> # copy dmoz/urls into ndfs
> bin/nutch ndfs -put dmoz dmoz
>
> # crawl
> nohup bin/nutch crawl dmoz -dir crawl -depth 4 -topN 16000000 <
> /dev/null >& crawl.log &
>
> Then I visited http://master:50030/ to monitor progress.
>
> I think that's it!
>
> Doug
>

Reply via email to