Hi Doug, Thanks alot for your precious time you gave for writing such a detailed and informative reply. Just wanted to confirm that this distributed crawl you did using nutch version 0.7.1 or some other version? And was that a successful distributed crawl using map reduce or some work around for distributed crawl?
Thanks and Regards, Pushpesh On 1/5/06, Doug Cutting <[EMAIL PROTECTED]> wrote: > > Earl Cahill wrote: > > Any chance you could walk through your implementation? > > Like how the twenty boxes were assigned? Maybe > > upload your confs somewhere, and outline what commands > > you actually ran? > > All 20 boxes are configured identically, running a Debian 2.4 kernel. > These are dual-processor boxes with 2GB of RAM each. Each machine has > four drives, mounted as a RAID on /export/crawlspace. This cluster uses > NFS to mount home directories, so I did not have to set NUTCH_MASTER in > order to rsync copies of nutch to all machines. > > I installed JDK 1.5 in ~/local/java, Ant in ~/local/ant and subversion > in ~/local/svn. > > My ~/.ssh/environment contains: > > JAVA_HOME=/home/dcutting/local/java > NUTCH_OPTS=-server > NUTCH_LOG_DIR=/export/crawlspace/tmp/dcutting/logs > NUTCH_SLAVES=/home/dcutting/.slaves > > I added the following to ~/.bash_profile, then logged out & back in. > > export `cat ~/.ssh/environment` > > I added the following to /etc/ssh/sshd_config on all hosts: > > PermitUserEnvironment yes > > My ~/.slaves file contains a list of all 20 slave hosts, one per line. > > My ~/src/nutch/conf/mapred-default.xml contains: > > <nutch-conf> > > <property> > <name>mapred.map.tasks</name> > <value>1000</value> > </property> > > <property> > <name>mapred.reduce.tasks</name> > <value>39</value> > </property> > > </nutch-conf> > > My ~/src/nutch/conf/nutch-site.xml contains: > > <nutch-conf> > > <property> > <name>fetcher.threads.fetch</name> > <value>100</value> > </property> > > <property> > <name>generate.max.per.host</name> > <value>100</value> > </property> > > <property> > <name>plugin.includes</name> > > > <value>protocol-http|urlfilter-regex|parse-(html)|index-basic|query-(basic|site|url)</value> > </property> > > <property> > <name>parser.html.impl</name> > <value>tagsoup</value> > </property> > > <!-- NDFS --> > > <property> > <name>fs.default.name</name> > <value>adminhost:8009</value> > </property> > > <property> > <name>ndfs.name.dir</name> > <value>/export/crawlspace/tmp/dcutting/ndfs/names</value> > </property> > > <property> > <name>ndfs.data.dir</name> > <value>/export/crawlspace/tmp/dcutting/ndfs</value> > </property> > > <!-- MapReduce --> > > <property> > <name>mapred.job.tracker</name> > <value>adminhost:8010</value> > </property> > > <property> > <name>mapred.system.dir</name> > <value>/mapred/system</value> > </property> > > <property> > <name>mapred.local.dir</name> > <value>/export/crawlspace/tmp/dcutting/local</value> > </property> > > <property> > <name>mapred.child.heap.size</name> > <value>500m</value> > </property> > > </nutch-conf> > > My ~/src/nutch/conf/crawl-urlfilter.txt contains: > > # skip file:, ftp:, & mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > > -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$ > > # skip URLs containing certain characters as probable queries, etc. > [EMAIL PROTECTED] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/.+?)/.*?\1/.*?\1/ > > # accept everything else > +. > > To run the crawl I gave the following commands on the master host: > > # checkout nutch sources and build them > mkdir ~/src > cd ~/src > ~/local/svn co https://svn.apache.org/repos/asf/lucene/nutch/trunk nutch > cd nutch > ~/local/ant/bin/ant > > # install config files named above in ~/src/nutch/conf > > # create dmoz/urls file > wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz > gunzip content.rdf.u8.gz > mkdir dmoz > bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8.gz > dmoz/urls > > # create required directories on slaves > bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/logs > bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/local > bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/ndfs/names > > # start nutch daemons > bin/start-all.sh > > # copy dmoz/urls into ndfs > bin/nutch ndfs -put dmoz dmoz > > # crawl > nohup bin/nutch crawl dmoz -dir crawl -depth 4 -topN 16000000 < > /dev/null >& crawl.log & > > Then I visited http://master:50030/ to monitor progress. > > I think that's it! > > Doug >
