Re: Distributed fetching only happening in one node ?

brainstorm Tue, 15 Jul 2008 08:43:18 -0700

Yep, I know about wireshark, and wanted to avoid it to debug this
issue (perhaps there was a simple solution/known bug/issue)...


I just launched wireshark on frontend with filter tcp.port == 50010,
and now I'm diving on the tcp stream... let's see if I see the light
(RST flag somewhere ?), thanks anyway for replying ;)

Just for the record, the phase that stalls is fetcher during reduce:

Jobid   User    Name    Map % Complete  Map Total       Maps Completed  Reduce %
Complete        Reduce Total    Reduces Completed
job_200807151723_0005   hadoop  fetch crawl-ecxi/segments/20080715172458        
100.00%
        2       2       16.66%
        
        1       0

It's stuck on 16%, no traffic, no crawling, but still "running".

On Tue, Jul 15, 2008 at 4:28 PM, Patrick Markiewicz
<[EMAIL PROTECTED]> wrote:
> Hi brain,
>        If I were you, I would download wireshark
> (http://www.wireshark.org/download.html) to see what is happening at the
> network layer and see if that provides any clues.  A socket exception
> that you don't expect is usually due to one side of the conversation not
> understanding the other side.  If you have 4 machines, then you have 4
> possible places where default firewall rules could be causing an issue.
> If it is not the firewall rules, the NAT rules could be a potential
> source of error.  Also, even a router hardware error could cause a
> problem.
>        If you understand TCP, just make sure that you see all the
> correct TCP stuff happening in wireshark.  If you don't understand
> wireshark's display, let me know, and I'll pass on some quickstart
> information.
>
>        If you already know all of this, I don't have any way to help
> you, as it looks like you're trying to accomplish something trickier
> with nutch than I have ever attempted.
>
> Patrick
>
> -----Original Message-----
> From: brainstorm [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, July 15, 2008 10:08 AM
> To: [email protected]
> Subject: Re: Distributed fetching only happening in one node ?
>
> Boiling down the problem I'm stuck on this:
>
> 2008-07-14 16:43:24,976 WARN  dfs.DataNode -
> 192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to
> 192.168.0.252:50010 got java.net.SocketException: Connection reset
>        at
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
>        at
> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>        at
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>        at
> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>        at
> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602)
>        at
> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636)
>        at
> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391)
>        at java.lang.Thread.run(Thread.java:595)
>
> Checked that firewall settings between node & frontend were not
> blocking packets, and they don't... anyone knows why is this ? If not,
> could you provide a convenient way to debug it ?
>
> Thanks !
>
> On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <[EMAIL PROTECTED]> wrote:
>> Hi,
>>
>> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
>> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
>> best suited network topology for inet crawling (frontend being a net
>> bottleneck), but I think it's fine for testing purposes.
>>
>> I'm having issues with fetch mapreduce job:
>>
>> According to ganglia monitoring (network traffic), and hadoop
>> administrative interfaces, fetch phase is only being executed in the
>> frontend node, where I launched "nutch crawl". Previous nutch phases
>> were executed neatly distributed on all nodes:
>>
>> job_200807131223_0001   hadoop  inject urls     100.00%
>>        2       2       100.00%
>>        1       1
>> job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb
> 100.00%
>>        3       3       100.00%
>>        1       1
>> job_200807131223_0003   hadoop  generate: select
>> crawl-ecxi/segments/20080713123547      100.00%
>>        3       3       100.00%
>>        1       1
>> job_200807131223_0004   hadoop  generate: partition
>> crawl-ecxi/segments/20080713123547      100.00%
>>        4       4       100.00%
>>        2       2
>>
>> I've checked that:
>>
>> 1) Nodes have inet connectivity, firewall settings
>> 2) There's enough space on local discs
>> 3) Proper processes are running on nodes
>>
>> frontend-node:
>> ==========
>>
>> [EMAIL PROTECTED] ~]# jps
>> 29232 NameNode
>> 29489 DataNode
>> 29860 JobTracker
>> 29778 SecondaryNameNode
>> 31122 Crawl
>> 30137 TaskTracker
>> 10989 Jps
>> 1818 TaskTracker$Child
>>
>> leaf nodes:
>> ========
>>
>> [EMAIL PROTECTED] ~]# cluster-fork jps
>> compute-0-1:
>> 23929 Jps
>> 15568 TaskTracker
>> 15361 DataNode
>> compute-0-2:
>> 32272 TaskTracker
>> 32065 DataNode
>> 7197 Jps
>> 2397 TaskTracker$Child
>> compute-0-3:
>> 12054 DataNode
>> 19584 Jps
>> 14824 TaskTracker$Child
>> 12261 TaskTracker
>>
>> 4) Logs only show fetching process (taking place only in the head
> node):
>>
>> 2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
>> http://valleycycles.net/
>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>> robots.txt for http://www.getting-forward.org/:
>> java.net.UnknownHostException: www.getting-forward.org
>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>> robots.txt for http://www.getting-forward.org/:
>> java.net.UnknownHostException: www.getting-forward.org
>>
>> What am I missing ? Why there are no fetching instances on nodes ? I
>> used the following custom script to launch a pristine crawl each time:
>>
>> #!/bin/sh
>>
>> # 1) Stops hadoop daemons
>> # 2) Overwrites new url list on HDFS
>> # 3) Starts hadoop daemons
>> # 4) Performs a clean crawl
>>
>> #export JAVA_HOME=/usr/lib/jvm/java-6-sun
>> export JAVA_HOME=/usr/java/jdk1.5.0_10
>>
>> CRAWL_DIR=crawl-ecxi || $1
>> URL_DIR=urls || $2
>>
>> echo $CRAWL_DIR
>> echo $URL_DIR
>>
>> echo "Leaving safe mode..."
>> ./hadoop dfsadmin -safemode leave
>>
>> echo "Removing seed urls directory and previous crawled content..."
>> ./hadoop dfs -rmr $URL_DIR
>> ./hadoop dfs -rmr $CRAWL_DIR
>>
>> echo "Removing past logs"
>>
>> rm -rf ../logs/*
>>
>> echo "Uploading seed urls..."
>> ./hadoop dfs -put ../$URL_DIR $URL_DIR
>>
>> #echo "Entering safe mode..."
>> #./hadoop dfsadmin -safemode enter
>>
>> echo "******************"
>> echo "* STARTING CRAWL *"
>> echo "******************"
>>
>> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
>>
>>
>> Next step I'm thinking on to fix the problem is to install
>> nutch+hadoop as specified in this past nutch-user mail:
>>
>> http://www.mail-archive.com/[email protected]/msg10225.html
>>
>> As I don't know if it's current practice on trunk (archived mail is
>> from Wed, 02 Jan 2008), I wanted to ask if there's another way to fix
>> it or if it's being worked on by someone... I haven't found a matching
>> bug on JIRA :_/
>>
>

Re: Distributed fetching only happening in one node ?

Reply via email to