Distributed fetching only happening in one node ?

brainstorm Sun, 13 Jul 2008 06:42:06 -0700

Hi,

I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
best suited network topology for inet crawling (frontend being a net
bottleneck), but I think it's fine for testing purposes.


I'm having issues with fetch mapreduce job:

According to ganglia monitoring (network traffic), and hadoop
administrative interfaces, fetch phase is only being executed in the
frontend node, where I launched "nutch crawl". Previous nutch phases
were executed neatly distributed on all nodes:

job_200807131223_0001   hadoop  inject urls     100.00%
        2       2       100.00%
        1       1
job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb      100.00%
        3       3       100.00%
        1       1
job_200807131223_0003   hadoop  generate: select
crawl-ecxi/segments/20080713123547      100.00%
        3       3       100.00%
        1       1
job_200807131223_0004   hadoop  generate: partition
crawl-ecxi/segments/20080713123547      100.00%
        4       4       100.00%
        2       2

I've checked that:

1) Nodes have inet connectivity, firewall settings
2) There's enough space on local discs
3) Proper processes are running on nodes

frontend-node:
==========

[EMAIL PROTECTED] ~]# jps
29232 NameNode
29489 DataNode
29860 JobTracker
29778 SecondaryNameNode
31122 Crawl
30137 TaskTracker
10989 Jps
1818 TaskTracker$Child

leaf nodes:
========

[EMAIL PROTECTED] ~]# cluster-fork jps
compute-0-1:
23929 Jps
15568 TaskTracker
15361 DataNode
compute-0-2:
32272 TaskTracker
32065 DataNode
7197 Jps
2397 TaskTracker$Child
compute-0-3:
12054 DataNode
19584 Jps
14824 TaskTracker$Child
12261 TaskTracker

4) Logs only show fetching process (taking place only in the head node):

2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
http://valleycycles.net/
2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
robots.txt for http://www.getting-forward.org/:
java.net.UnknownHostException: www.getting-forward.org
2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
robots.txt for http://www.getting-forward.org/:
java.net.UnknownHostException: www.getting-forward.org

What am I missing ? Why there are no fetching instances on nodes ? I
used the following custom script to launch a pristine crawl each time:

#!/bin/sh

# 1) Stops hadoop daemons
# 2) Overwrites new url list on HDFS
# 3) Starts hadoop daemons
# 4) Performs a clean crawl

#export JAVA_HOME=/usr/lib/jvm/java-6-sun
export JAVA_HOME=/usr/java/jdk1.5.0_10

CRAWL_DIR=crawl-ecxi || $1
URL_DIR=urls || $2

echo $CRAWL_DIR
echo $URL_DIR

echo "Leaving safe mode..."
./hadoop dfsadmin -safemode leave

echo "Removing seed urls directory and previous crawled content..."
./hadoop dfs -rmr $URL_DIR
./hadoop dfs -rmr $CRAWL_DIR

echo "Removing past logs"

rm -rf ../logs/*

echo "Uploading seed urls..."
./hadoop dfs -put ../$URL_DIR $URL_DIR

#echo "Entering safe mode..."
#./hadoop dfsadmin -safemode enter

echo "******************"
echo "* STARTING CRAWL *"
echo "******************"

./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3


Next step I'm thinking on to fix the problem is to install
nutch+hadoop as specified in this past nutch-user mail:

http://www.mail-archive.com/[email protected]/msg10225.html

As I don't know if it's current practice on trunk (archived mail is
from Wed, 02 Jan 2008), I wanted to ask if there's another way to fix
it or if it's being worked on by someone... I haven't found a matching
bug on JIRA :_/

Distributed fetching only happening in one node ?

Reply via email to