Hi brain,
        If I were you, I would download wireshark
(http://www.wireshark.org/download.html) to see what is happening at the
network layer and see if that provides any clues.  A socket exception
that you don't expect is usually due to one side of the conversation not
understanding the other side.  If you have 4 machines, then you have 4
possible places where default firewall rules could be causing an issue.
If it is not the firewall rules, the NAT rules could be a potential
source of error.  Also, even a router hardware error could cause a
problem.
        If you understand TCP, just make sure that you see all the
correct TCP stuff happening in wireshark.  If you don't understand
wireshark's display, let me know, and I'll pass on some quickstart
information.

        If you already know all of this, I don't have any way to help
you, as it looks like you're trying to accomplish something trickier
with nutch than I have ever attempted.

Patrick

-----Original Message-----
From: brainstorm [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, July 15, 2008 10:08 AM
To: [email protected]
Subject: Re: Distributed fetching only happening in one node ?

Boiling down the problem I'm stuck on this:

2008-07-14 16:43:24,976 WARN  dfs.DataNode -
192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to
192.168.0.252:50010 got java.net.SocketException: Connection reset
        at
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
        at
java.net.SocketOutputStream.write(SocketOutputStream.java:136)
        at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at
java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at
org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602)
        at
org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636)
        at
org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391)
        at java.lang.Thread.run(Thread.java:595)

Checked that firewall settings between node & frontend were not
blocking packets, and they don't... anyone knows why is this ? If not,
could you provide a convenient way to debug it ?

Thanks !

On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
> best suited network topology for inet crawling (frontend being a net
> bottleneck), but I think it's fine for testing purposes.
>
> I'm having issues with fetch mapreduce job:
>
> According to ganglia monitoring (network traffic), and hadoop
> administrative interfaces, fetch phase is only being executed in the
> frontend node, where I launched "nutch crawl". Previous nutch phases
> were executed neatly distributed on all nodes:
>
> job_200807131223_0001   hadoop  inject urls     100.00%
>        2       2       100.00%
>        1       1
> job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb
100.00%
>        3       3       100.00%
>        1       1
> job_200807131223_0003   hadoop  generate: select
> crawl-ecxi/segments/20080713123547      100.00%
>        3       3       100.00%
>        1       1
> job_200807131223_0004   hadoop  generate: partition
> crawl-ecxi/segments/20080713123547      100.00%
>        4       4       100.00%
>        2       2
>
> I've checked that:
>
> 1) Nodes have inet connectivity, firewall settings
> 2) There's enough space on local discs
> 3) Proper processes are running on nodes
>
> frontend-node:
> ==========
>
> [EMAIL PROTECTED] ~]# jps
> 29232 NameNode
> 29489 DataNode
> 29860 JobTracker
> 29778 SecondaryNameNode
> 31122 Crawl
> 30137 TaskTracker
> 10989 Jps
> 1818 TaskTracker$Child
>
> leaf nodes:
> ========
>
> [EMAIL PROTECTED] ~]# cluster-fork jps
> compute-0-1:
> 23929 Jps
> 15568 TaskTracker
> 15361 DataNode
> compute-0-2:
> 32272 TaskTracker
> 32065 DataNode
> 7197 Jps
> 2397 TaskTracker$Child
> compute-0-3:
> 12054 DataNode
> 19584 Jps
> 14824 TaskTracker$Child
> 12261 TaskTracker
>
> 4) Logs only show fetching process (taking place only in the head
node):
>
> 2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
> http://valleycycles.net/
> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
> robots.txt for http://www.getting-forward.org/:
> java.net.UnknownHostException: www.getting-forward.org
> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
> robots.txt for http://www.getting-forward.org/:
> java.net.UnknownHostException: www.getting-forward.org
>
> What am I missing ? Why there are no fetching instances on nodes ? I
> used the following custom script to launch a pristine crawl each time:
>
> #!/bin/sh
>
> # 1) Stops hadoop daemons
> # 2) Overwrites new url list on HDFS
> # 3) Starts hadoop daemons
> # 4) Performs a clean crawl
>
> #export JAVA_HOME=/usr/lib/jvm/java-6-sun
> export JAVA_HOME=/usr/java/jdk1.5.0_10
>
> CRAWL_DIR=crawl-ecxi || $1
> URL_DIR=urls || $2
>
> echo $CRAWL_DIR
> echo $URL_DIR
>
> echo "Leaving safe mode..."
> ./hadoop dfsadmin -safemode leave
>
> echo "Removing seed urls directory and previous crawled content..."
> ./hadoop dfs -rmr $URL_DIR
> ./hadoop dfs -rmr $CRAWL_DIR
>
> echo "Removing past logs"
>
> rm -rf ../logs/*
>
> echo "Uploading seed urls..."
> ./hadoop dfs -put ../$URL_DIR $URL_DIR
>
> #echo "Entering safe mode..."
> #./hadoop dfsadmin -safemode enter
>
> echo "******************"
> echo "* STARTING CRAWL *"
> echo "******************"
>
> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
>
>
> Next step I'm thinking on to fix the problem is to install
> nutch+hadoop as specified in this past nutch-user mail:
>
> http://www.mail-archive.com/[email protected]/msg10225.html
>
> As I don't know if it's current practice on trunk (archived mail is
> from Wed, 02 Jan 2008), I wanted to ask if there's another way to fix
> it or if it's being worked on by someone... I haven't found a matching
> bug on JIRA :_/
>

Reply via email to