Hi,
I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
best suited network topology for inet crawling (frontend being a net
bottleneck), but I think it's fine for testing purposes.
I'm having issues with fetch mapreduce job:
According to ganglia monitoring (network traffic), and hadoop
administrative interfaces, fetch phase is only being executed in the
frontend node, where I launched "nutch crawl". Previous nutch phases
were executed neatly distributed on all nodes:
job_200807131223_0001 hadoop inject urls 100.00%
2 2 100.00%
1 1
job_200807131223_0002 hadoop crawldb crawl-ecxi/crawldb 100.00%
3 3 100.00%
1 1
job_200807131223_0003 hadoop generate: select
crawl-ecxi/segments/20080713123547 100.00%
3 3 100.00%
1 1
job_200807131223_0004 hadoop generate: partition
crawl-ecxi/segments/20080713123547 100.00%
4 4 100.00%
2 2
I've checked that:
1) Nodes have inet connectivity, firewall settings
2) There's enough space on local discs
3) Proper processes are running on nodes
frontend-node:
==========
[EMAIL PROTECTED] ~]# jps
29232 NameNode
29489 DataNode
29860 JobTracker
29778 SecondaryNameNode
31122 Crawl
30137 TaskTracker
10989 Jps
1818 TaskTracker$Child
leaf nodes:
========
[EMAIL PROTECTED] ~]# cluster-fork jps
compute-0-1:
23929 Jps
15568 TaskTracker
15361 DataNode
compute-0-2:
32272 TaskTracker
32065 DataNode
7197 Jps
2397 TaskTracker$Child
compute-0-3:
12054 DataNode
19584 Jps
14824 TaskTracker$Child
12261 TaskTracker
4) Logs only show fetching process (taking place only in the head node):
2008-07-13 13:33:22,306 INFO fetcher.Fetcher - fetching
http://valleycycles.net/
2008-07-13 13:33:22,349 INFO api.RobotRulesParser - Couldn't get
robots.txt for http://www.getting-forward.org/:
java.net.UnknownHostException: www.getting-forward.org
2008-07-13 13:33:22,349 INFO api.RobotRulesParser - Couldn't get
robots.txt for http://www.getting-forward.org/:
java.net.UnknownHostException: www.getting-forward.org
What am I missing ? Why there are no fetching instances on nodes ? I
used the following custom script to launch a pristine crawl each time:
#!/bin/sh
# 1) Stops hadoop daemons
# 2) Overwrites new url list on HDFS
# 3) Starts hadoop daemons
# 4) Performs a clean crawl
#export JAVA_HOME=/usr/lib/jvm/java-6-sun
export JAVA_HOME=/usr/java/jdk1.5.0_10
CRAWL_DIR=crawl-ecxi || $1
URL_DIR=urls || $2
echo $CRAWL_DIR
echo $URL_DIR
echo "Leaving safe mode..."
./hadoop dfsadmin -safemode leave
echo "Removing seed urls directory and previous crawled content..."
./hadoop dfs -rmr $URL_DIR
./hadoop dfs -rmr $CRAWL_DIR
echo "Removing past logs"
rm -rf ../logs/*
echo "Uploading seed urls..."
./hadoop dfs -put ../$URL_DIR $URL_DIR
#echo "Entering safe mode..."
#./hadoop dfsadmin -safemode enter
echo "******************"
echo "* STARTING CRAWL *"
echo "******************"
./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
Next step I'm thinking on to fix the problem is to install
nutch+hadoop as specified in this past nutch-user mail:
http://www.mail-archive.com/[email protected]/msg10225.html
As I don't know if it's current practice on trunk (archived mail is
from Wed, 02 Jan 2008), I wanted to ask if there's another way to fix
it or if it's being worked on by someone... I haven't found a matching
bug on JIRA :_/