Re: Distributed fetching only happening in one node ?

brainstorm Tue, 15 Jul 2008 07:08:38 -0700

Boiling down the problem I'm stuck on this:

2008-07-14 16:43:24,976 WARN  dfs.DataNode -
192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to
192.168.0.252:50010 got java.net.SocketException: Connection reset
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
        at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at 
org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602)
        at 
org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636)
        at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391)
        at java.lang.Thread.run(Thread.java:595)


Checked that firewall settings between node & frontend were not
blocking packets, and they don't... anyone knows why is this ? If not,
could you provide a convenient way to debug it ?

Thanks !

On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
> best suited network topology for inet crawling (frontend being a net
> bottleneck), but I think it's fine for testing purposes.
>
> I'm having issues with fetch mapreduce job:
>
> According to ganglia monitoring (network traffic), and hadoop
> administrative interfaces, fetch phase is only being executed in the
> frontend node, where I launched "nutch crawl". Previous nutch phases
> were executed neatly distributed on all nodes:
>
> job_200807131223_0001   hadoop  inject urls     100.00%
>        2       2       100.00%
>        1       1
> job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb      100.00%
>        3       3       100.00%
>        1       1
> job_200807131223_0003   hadoop  generate: select
> crawl-ecxi/segments/20080713123547      100.00%
>        3       3       100.00%
>        1       1
> job_200807131223_0004   hadoop  generate: partition
> crawl-ecxi/segments/20080713123547      100.00%
>        4       4       100.00%
>        2       2
>
> I've checked that:
>
> 1) Nodes have inet connectivity, firewall settings
> 2) There's enough space on local discs
> 3) Proper processes are running on nodes
>
> frontend-node:
> ==========
>
> [EMAIL PROTECTED] ~]# jps
> 29232 NameNode
> 29489 DataNode
> 29860 JobTracker
> 29778 SecondaryNameNode
> 31122 Crawl
> 30137 TaskTracker
> 10989 Jps
> 1818 TaskTracker$Child
>
> leaf nodes:
> ========
>
> [EMAIL PROTECTED] ~]# cluster-fork jps
> compute-0-1:
> 23929 Jps
> 15568 TaskTracker
> 15361 DataNode
> compute-0-2:
> 32272 TaskTracker
> 32065 DataNode
> 7197 Jps
> 2397 TaskTracker$Child
> compute-0-3:
> 12054 DataNode
> 19584 Jps
> 14824 TaskTracker$Child
> 12261 TaskTracker
>
> 4) Logs only show fetching process (taking place only in the head node):
>
> 2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
> http://valleycycles.net/
> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
> robots.txt for http://www.getting-forward.org/:
> java.net.UnknownHostException: www.getting-forward.org
> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
> robots.txt for http://www.getting-forward.org/:
> java.net.UnknownHostException: www.getting-forward.org
>
> What am I missing ? Why there are no fetching instances on nodes ? I
> used the following custom script to launch a pristine crawl each time:
>
> #!/bin/sh
>
> # 1) Stops hadoop daemons
> # 2) Overwrites new url list on HDFS
> # 3) Starts hadoop daemons
> # 4) Performs a clean crawl
>
> #export JAVA_HOME=/usr/lib/jvm/java-6-sun
> export JAVA_HOME=/usr/java/jdk1.5.0_10
>
> CRAWL_DIR=crawl-ecxi || $1
> URL_DIR=urls || $2
>
> echo $CRAWL_DIR
> echo $URL_DIR
>
> echo "Leaving safe mode..."
> ./hadoop dfsadmin -safemode leave
>
> echo "Removing seed urls directory and previous crawled content..."
> ./hadoop dfs -rmr $URL_DIR
> ./hadoop dfs -rmr $CRAWL_DIR
>
> echo "Removing past logs"
>
> rm -rf ../logs/*
>
> echo "Uploading seed urls..."
> ./hadoop dfs -put ../$URL_DIR $URL_DIR
>
> #echo "Entering safe mode..."
> #./hadoop dfsadmin -safemode enter
>
> echo "******************"
> echo "* STARTING CRAWL *"
> echo "******************"
>
> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
>
>
> Next step I'm thinking on to fix the problem is to install
> nutch+hadoop as specified in this past nutch-user mail:
>
> http://www.mail-archive.com/[email protected]/msg10225.html
>
> As I don't know if it's current practice on trunk (archived mail is
> from Wed, 02 Jan 2008), I wanted to ask if there's another way to fix
> it or if it's being worked on by someone... I haven't found a matching
> bug on JIRA :_/
>

Re: Distributed fetching only happening in one node ?

Reply via email to