Ok, DFS warnings problem solved, seems that hadoop-0.17.1 patch fixes
the warnings... BUT, on a 7-node nutch cluster:

1) Fetching is only happening on *one* node despite several values
tested on settings:
mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum
export HADOOP_HEAPSIZE

I've played with mapreduce (hadoop-site.xml) settings as advised on:

http://wiki.apache.org/hadoop/HowManyMapsAndReduces

But nutch keeps crawling only using one node, instead of seven
nodes... anybody knows why ?

I've had a look at the code, searching for:

conf.setNumMapTasks(int num), but found none: so I guess that the
number of mappers & reducers are not limited programatically.

2) Even on a single node, the fetching is really slow: 1 url or page
per second, at most.

Can anybody shed some light into this ? Pointing which class/code I
should look into to modify this behaviour will help also.

Anybody has a distributed nutch crawling cluster working with all
nodes fetching at fetch phase ?

I even did some numbers using wordcount example using 7 nodes at 100%
cpu usage using a 425MB parsedtext file:

maps    reduces heapsize        time
2       2       500     3m43.049s
4       4       500     4m41.846s
8       8       500     4m29.344s
16      16      500     3m43.672s
32      32      500     3m41.367s
64      64      500     4m27.275s
128     128     500     4m35.233s
256     256     500     3m41.916s
                        
                        
2       2       2000    4m31.434s
4       4       2000    
8       8       2000    
16      16      2000    4m32.213s
32      32      2000    
64      64      2000    
128     128     2000    
256     256     2000    4m38.310s

Thanks in advance,
Roman

On Tue, Jul 15, 2008 at 7:15 PM, brainstorm <[EMAIL PROTECTED]> wrote:
> While seeing DFS wireshark trace (and the corresponding RST's), the
> crawl continued to next step... seems that this WARNING is actually
> slowing down the whole crawling process (it took 36 minutes to
> complete the previous fetch) with just 3 urls seed file :-!!!
>
> I just posted a couple of exceptions/questions regarding DFS on hadoop
> core mailing list.
>
> PD: As a side note, the following error caught my attention:
>
> Fetcher: starting
> Fetcher: segment: crawl-ecxi/segments/20080715172458
> Too many fetch-failures
> task_200807151723_0005_m_000000_0: Fetcher: threads: 10
> task_200807151723_0005_m_000000_0: fetching http://upc.es/
> task_200807151723_0005_m_000000_0: fetching http://upc.edu/
> task_200807151723_0005_m_000000_0: fetching http://upc.cat/
> task_200807151723_0005_m_000000_0: fetch of http://upc.cat/ failed
> with: org.apache.nutch.protocol.http.api.HttpException:
> java.net.UnknownHostException: upc.cat
>
> Unknown host ?¿ Just try "http://upc.cat"; on your browser, it *does*
> exist, it just gets redirected to www.upc.cat :-/
>
> On Tue, Jul 15, 2008 at 5:42 PM, brainstorm <[EMAIL PROTECTED]> wrote:
>> Yep, I know about wireshark, and wanted to avoid it to debug this
>> issue (perhaps there was a simple solution/known bug/issue)...
>>
>> I just launched wireshark on frontend with filter tcp.port == 50010,
>> and now I'm diving on the tcp stream... let's see if I see the light
>> (RST flag somewhere ?), thanks anyway for replying ;)
>>
>> Just for the record, the phase that stalls is fetcher during reduce:
>>
>> Jobid   User    Name    Map % Complete  Map Total       Maps Completed  
>> Reduce %
>> Complete        Reduce Total    Reduces Completed
>> job_200807151723_0005   hadoop  fetch crawl-ecxi/segments/20080715172458     
>>    100.00%
>>        2       2       16.66%
>>
>>        1       0
>>
>> It's stuck on 16%, no traffic, no crawling, but still "running".
>>
>> On Tue, Jul 15, 2008 at 4:28 PM, Patrick Markiewicz
>> <[EMAIL PROTECTED]> wrote:
>>> Hi brain,
>>>        If I were you, I would download wireshark
>>> (http://www.wireshark.org/download.html) to see what is happening at the
>>> network layer and see if that provides any clues.  A socket exception
>>> that you don't expect is usually due to one side of the conversation not
>>> understanding the other side.  If you have 4 machines, then you have 4
>>> possible places where default firewall rules could be causing an issue.
>>> If it is not the firewall rules, the NAT rules could be a potential
>>> source of error.  Also, even a router hardware error could cause a
>>> problem.
>>>        If you understand TCP, just make sure that you see all the
>>> correct TCP stuff happening in wireshark.  If you don't understand
>>> wireshark's display, let me know, and I'll pass on some quickstart
>>> information.
>>>
>>>        If you already know all of this, I don't have any way to help
>>> you, as it looks like you're trying to accomplish something trickier
>>> with nutch than I have ever attempted.
>>>
>>> Patrick
>>>
>>> -----Original Message-----
>>> From: brainstorm [mailto:[EMAIL PROTECTED]
>>> Sent: Tuesday, July 15, 2008 10:08 AM
>>> To: [email protected]
>>> Subject: Re: Distributed fetching only happening in one node ?
>>>
>>> Boiling down the problem I'm stuck on this:
>>>
>>> 2008-07-14 16:43:24,976 WARN  dfs.DataNode -
>>> 192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to
>>> 192.168.0.252:50010 got java.net.SocketException: Connection reset
>>>        at
>>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
>>>        at
>>> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>>>        at
>>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>>>        at
>>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>>>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>>        at
>>> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602)
>>>        at
>>> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636)
>>>        at
>>> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391)
>>>        at java.lang.Thread.run(Thread.java:595)
>>>
>>> Checked that firewall settings between node & frontend were not
>>> blocking packets, and they don't... anyone knows why is this ? If not,
>>> could you provide a convenient way to debug it ?
>>>
>>> Thanks !
>>>
>>> On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <[EMAIL PROTECTED]> wrote:
>>>> Hi,
>>>>
>>>> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
>>>> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
>>>> best suited network topology for inet crawling (frontend being a net
>>>> bottleneck), but I think it's fine for testing purposes.
>>>>
>>>> I'm having issues with fetch mapreduce job:
>>>>
>>>> According to ganglia monitoring (network traffic), and hadoop
>>>> administrative interfaces, fetch phase is only being executed in the
>>>> frontend node, where I launched "nutch crawl". Previous nutch phases
>>>> were executed neatly distributed on all nodes:
>>>>
>>>> job_200807131223_0001   hadoop  inject urls     100.00%
>>>>        2       2       100.00%
>>>>        1       1
>>>> job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb
>>> 100.00%
>>>>        3       3       100.00%
>>>>        1       1
>>>> job_200807131223_0003   hadoop  generate: select
>>>> crawl-ecxi/segments/20080713123547      100.00%
>>>>        3       3       100.00%
>>>>        1       1
>>>> job_200807131223_0004   hadoop  generate: partition
>>>> crawl-ecxi/segments/20080713123547      100.00%
>>>>        4       4       100.00%
>>>>        2       2
>>>>
>>>> I've checked that:
>>>>
>>>> 1) Nodes have inet connectivity, firewall settings
>>>> 2) There's enough space on local discs
>>>> 3) Proper processes are running on nodes
>>>>
>>>> frontend-node:
>>>> ==========
>>>>
>>>> [EMAIL PROTECTED] ~]# jps
>>>> 29232 NameNode
>>>> 29489 DataNode
>>>> 29860 JobTracker
>>>> 29778 SecondaryNameNode
>>>> 31122 Crawl
>>>> 30137 TaskTracker
>>>> 10989 Jps
>>>> 1818 TaskTracker$Child
>>>>
>>>> leaf nodes:
>>>> ========
>>>>
>>>> [EMAIL PROTECTED] ~]# cluster-fork jps
>>>> compute-0-1:
>>>> 23929 Jps
>>>> 15568 TaskTracker
>>>> 15361 DataNode
>>>> compute-0-2:
>>>> 32272 TaskTracker
>>>> 32065 DataNode
>>>> 7197 Jps
>>>> 2397 TaskTracker$Child
>>>> compute-0-3:
>>>> 12054 DataNode
>>>> 19584 Jps
>>>> 14824 TaskTracker$Child
>>>> 12261 TaskTracker
>>>>
>>>> 4) Logs only show fetching process (taking place only in the head
>>> node):
>>>>
>>>> 2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
>>>> http://valleycycles.net/
>>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>>>> robots.txt for http://www.getting-forward.org/:
>>>> java.net.UnknownHostException: www.getting-forward.org
>>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>>>> robots.txt for http://www.getting-forward.org/:
>>>> java.net.UnknownHostException: www.getting-forward.org
>>>>
>>>> What am I missing ? Why there are no fetching instances on nodes ? I
>>>> used the following custom script to launch a pristine crawl each time:
>>>>
>>>> #!/bin/sh
>>>>
>>>> # 1) Stops hadoop daemons
>>>> # 2) Overwrites new url list on HDFS
>>>> # 3) Starts hadoop daemons
>>>> # 4) Performs a clean crawl
>>>>
>>>> #export JAVA_HOME=/usr/lib/jvm/java-6-sun
>>>> export JAVA_HOME=/usr/java/jdk1.5.0_10
>>>>
>>>> CRAWL_DIR=crawl-ecxi || $1
>>>> URL_DIR=urls || $2
>>>>
>>>> echo $CRAWL_DIR
>>>> echo $URL_DIR
>>>>
>>>> echo "Leaving safe mode..."
>>>> ./hadoop dfsadmin -safemode leave
>>>>
>>>> echo "Removing seed urls directory and previous crawled content..."
>>>> ./hadoop dfs -rmr $URL_DIR
>>>> ./hadoop dfs -rmr $CRAWL_DIR
>>>>
>>>> echo "Removing past logs"
>>>>
>>>> rm -rf ../logs/*
>>>>
>>>> echo "Uploading seed urls..."
>>>> ./hadoop dfs -put ../$URL_DIR $URL_DIR
>>>>
>>>> #echo "Entering safe mode..."
>>>> #./hadoop dfsadmin -safemode enter
>>>>
>>>> echo "******************"
>>>> echo "* STARTING CRAWL *"
>>>> echo "******************"
>>>>
>>>> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
>>>>
>>>>
>>>> Next step I'm thinking on to fix the problem is to install
>>>> nutch+hadoop as specified in this past nutch-user mail:
>>>>
>>>> http://www.mail-archive.com/[email protected]/msg10225.html
>>>>
>>>> As I don't know if it's current practice on trunk (archived mail is
>>>> from Wed, 02 Jan 2008), I wanted to ask if there's another way to fix
>>>> it or if it's being worked on by someone... I haven't found a matching
>>>> bug on JIRA :_/
>>>>
>>>
>>
>

Reply via email to