While seeing DFS wireshark trace (and the corresponding RST's), the
crawl continued to next step... seems that this WARNING is actually
slowing down the whole crawling process (it took 36 minutes to
complete the previous fetch) with just 3 urls seed file :-!!!

I just posted a couple of exceptions/questions regarding DFS on hadoop
core mailing list.

PD: As a side note, the following error caught my attention:

Fetcher: starting
Fetcher: segment: crawl-ecxi/segments/20080715172458
Too many fetch-failures
task_200807151723_0005_m_000000_0: Fetcher: threads: 10
task_200807151723_0005_m_000000_0: fetching http://upc.es/
task_200807151723_0005_m_000000_0: fetching http://upc.edu/
task_200807151723_0005_m_000000_0: fetching http://upc.cat/
task_200807151723_0005_m_000000_0: fetch of http://upc.cat/ failed
with: org.apache.nutch.protocol.http.api.HttpException:
java.net.UnknownHostException: upc.cat

Unknown host ?¿ Just try "http://upc.cat"; on your browser, it *does*
exist, it just gets redirected to www.upc.cat :-/

On Tue, Jul 15, 2008 at 5:42 PM, brainstorm <[EMAIL PROTECTED]> wrote:
> Yep, I know about wireshark, and wanted to avoid it to debug this
> issue (perhaps there was a simple solution/known bug/issue)...
>
> I just launched wireshark on frontend with filter tcp.port == 50010,
> and now I'm diving on the tcp stream... let's see if I see the light
> (RST flag somewhere ?), thanks anyway for replying ;)
>
> Just for the record, the phase that stalls is fetcher during reduce:
>
> Jobid   User    Name    Map % Complete  Map Total       Maps Completed  
> Reduce %
> Complete        Reduce Total    Reduces Completed
> job_200807151723_0005   hadoop  fetch crawl-ecxi/segments/20080715172458      
>   100.00%
>        2       2       16.66%
>
>        1       0
>
> It's stuck on 16%, no traffic, no crawling, but still "running".
>
> On Tue, Jul 15, 2008 at 4:28 PM, Patrick Markiewicz
> <[EMAIL PROTECTED]> wrote:
>> Hi brain,
>>        If I were you, I would download wireshark
>> (http://www.wireshark.org/download.html) to see what is happening at the
>> network layer and see if that provides any clues.  A socket exception
>> that you don't expect is usually due to one side of the conversation not
>> understanding the other side.  If you have 4 machines, then you have 4
>> possible places where default firewall rules could be causing an issue.
>> If it is not the firewall rules, the NAT rules could be a potential
>> source of error.  Also, even a router hardware error could cause a
>> problem.
>>        If you understand TCP, just make sure that you see all the
>> correct TCP stuff happening in wireshark.  If you don't understand
>> wireshark's display, let me know, and I'll pass on some quickstart
>> information.
>>
>>        If you already know all of this, I don't have any way to help
>> you, as it looks like you're trying to accomplish something trickier
>> with nutch than I have ever attempted.
>>
>> Patrick
>>
>> -----Original Message-----
>> From: brainstorm [mailto:[EMAIL PROTECTED]
>> Sent: Tuesday, July 15, 2008 10:08 AM
>> To: [email protected]
>> Subject: Re: Distributed fetching only happening in one node ?
>>
>> Boiling down the problem I'm stuck on this:
>>
>> 2008-07-14 16:43:24,976 WARN  dfs.DataNode -
>> 192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to
>> 192.168.0.252:50010 got java.net.SocketException: Connection reset
>>        at
>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
>>        at
>> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>>        at
>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>>        at
>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>        at
>> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602)
>>        at
>> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636)
>>        at
>> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391)
>>        at java.lang.Thread.run(Thread.java:595)
>>
>> Checked that firewall settings between node & frontend were not
>> blocking packets, and they don't... anyone knows why is this ? If not,
>> could you provide a convenient way to debug it ?
>>
>> Thanks !
>>
>> On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <[EMAIL PROTECTED]> wrote:
>>> Hi,
>>>
>>> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
>>> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
>>> best suited network topology for inet crawling (frontend being a net
>>> bottleneck), but I think it's fine for testing purposes.
>>>
>>> I'm having issues with fetch mapreduce job:
>>>
>>> According to ganglia monitoring (network traffic), and hadoop
>>> administrative interfaces, fetch phase is only being executed in the
>>> frontend node, where I launched "nutch crawl". Previous nutch phases
>>> were executed neatly distributed on all nodes:
>>>
>>> job_200807131223_0001   hadoop  inject urls     100.00%
>>>        2       2       100.00%
>>>        1       1
>>> job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb
>> 100.00%
>>>        3       3       100.00%
>>>        1       1
>>> job_200807131223_0003   hadoop  generate: select
>>> crawl-ecxi/segments/20080713123547      100.00%
>>>        3       3       100.00%
>>>        1       1
>>> job_200807131223_0004   hadoop  generate: partition
>>> crawl-ecxi/segments/20080713123547      100.00%
>>>        4       4       100.00%
>>>        2       2
>>>
>>> I've checked that:
>>>
>>> 1) Nodes have inet connectivity, firewall settings
>>> 2) There's enough space on local discs
>>> 3) Proper processes are running on nodes
>>>
>>> frontend-node:
>>> ==========
>>>
>>> [EMAIL PROTECTED] ~]# jps
>>> 29232 NameNode
>>> 29489 DataNode
>>> 29860 JobTracker
>>> 29778 SecondaryNameNode
>>> 31122 Crawl
>>> 30137 TaskTracker
>>> 10989 Jps
>>> 1818 TaskTracker$Child
>>>
>>> leaf nodes:
>>> ========
>>>
>>> [EMAIL PROTECTED] ~]# cluster-fork jps
>>> compute-0-1:
>>> 23929 Jps
>>> 15568 TaskTracker
>>> 15361 DataNode
>>> compute-0-2:
>>> 32272 TaskTracker
>>> 32065 DataNode
>>> 7197 Jps
>>> 2397 TaskTracker$Child
>>> compute-0-3:
>>> 12054 DataNode
>>> 19584 Jps
>>> 14824 TaskTracker$Child
>>> 12261 TaskTracker
>>>
>>> 4) Logs only show fetching process (taking place only in the head
>> node):
>>>
>>> 2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
>>> http://valleycycles.net/
>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>>> robots.txt for http://www.getting-forward.org/:
>>> java.net.UnknownHostException: www.getting-forward.org
>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>>> robots.txt for http://www.getting-forward.org/:
>>> java.net.UnknownHostException: www.getting-forward.org
>>>
>>> What am I missing ? Why there are no fetching instances on nodes ? I
>>> used the following custom script to launch a pristine crawl each time:
>>>
>>> #!/bin/sh
>>>
>>> # 1) Stops hadoop daemons
>>> # 2) Overwrites new url list on HDFS
>>> # 3) Starts hadoop daemons
>>> # 4) Performs a clean crawl
>>>
>>> #export JAVA_HOME=/usr/lib/jvm/java-6-sun
>>> export JAVA_HOME=/usr/java/jdk1.5.0_10
>>>
>>> CRAWL_DIR=crawl-ecxi || $1
>>> URL_DIR=urls || $2
>>>
>>> echo $CRAWL_DIR
>>> echo $URL_DIR
>>>
>>> echo "Leaving safe mode..."
>>> ./hadoop dfsadmin -safemode leave
>>>
>>> echo "Removing seed urls directory and previous crawled content..."
>>> ./hadoop dfs -rmr $URL_DIR
>>> ./hadoop dfs -rmr $CRAWL_DIR
>>>
>>> echo "Removing past logs"
>>>
>>> rm -rf ../logs/*
>>>
>>> echo "Uploading seed urls..."
>>> ./hadoop dfs -put ../$URL_DIR $URL_DIR
>>>
>>> #echo "Entering safe mode..."
>>> #./hadoop dfsadmin -safemode enter
>>>
>>> echo "******************"
>>> echo "* STARTING CRAWL *"
>>> echo "******************"
>>>
>>> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
>>>
>>>
>>> Next step I'm thinking on to fix the problem is to install
>>> nutch+hadoop as specified in this past nutch-user mail:
>>>
>>> http://www.mail-archive.com/[email protected]/msg10225.html
>>>
>>> As I don't know if it's current practice on trunk (archived mail is
>>> from Wed, 02 Jan 2008), I wanted to ask if there's another way to fix
>>> it or if it's being worked on by someone... I haven't found a matching
>>> bug on JIRA :_/
>>>
>>
>

Reply via email to