Re: Distributed fetching only happening in one node ?

Alexander Aristov Mon, 04 Aug 2008 23:05:14 -0700

Hi

1. You should have set
mapred.map.tasks
and
mapred.reduce.tasks parameters They are set to 2 and 1 by default.


2. You can specify number of threads to perform fetching. Also there is a
parameter that slows down fetching from one URL,so called polite fetching to
not DOS the site.

So check you configuration.

Alex

2008/8/5 brainstorm <[EMAIL PROTECTED]>

> Ok, DFS warnings problem solved, seems that hadoop-0.17.1 patch fixes
> the warnings... BUT, on a 7-node nutch cluster:
>
> 1) Fetching is only happening on *one* node despite several values
> tested on settings:
> mapred.tasktracker.map.tasks.maximum
> mapred.tasktracker.reduce.tasks.maximum
> export HADOOP_HEAPSIZE
>
> I've played with mapreduce (hadoop-site.xml) settings as advised on:
>
> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>
> But nutch keeps crawling only using one node, instead of seven
> nodes... anybody knows why ?
>
> I've had a look at the code, searching for:
>
> conf.setNumMapTasks(int num), but found none: so I guess that the
> number of mappers & reducers are not limited programatically.
>
> 2) Even on a single node, the fetching is really slow: 1 url or page
> per second, at most.
>
> Can anybody shed some light into this ? Pointing which class/code I
> should look into to modify this behaviour will help also.
>
> Anybody has a distributed nutch crawling cluster working with all
> nodes fetching at fetch phase ?
>
> I even did some numbers using wordcount example using 7 nodes at 100%
> cpu usage using a 425MB parsedtext file:
>
> maps    reduces heapsize        time
> 2       2       500     3m43.049s
> 4       4       500     4m41.846s
> 8       8       500     4m29.344s
> 16      16      500     3m43.672s
> 32      32      500     3m41.367s
> 64      64      500     4m27.275s
> 128     128     500     4m35.233s
> 256     256     500     3m41.916s
>
>
> 2       2       2000    4m31.434s
> 4       4       2000
> 8       8       2000
> 16      16      2000    4m32.213s
> 32      32      2000
> 64      64      2000
> 128     128     2000
> 256     256     2000    4m38.310s
>
> Thanks in advance,
> Roman
>
> On Tue, Jul 15, 2008 at 7:15 PM, brainstorm <[EMAIL PROTECTED]> wrote:
> > While seeing DFS wireshark trace (and the corresponding RST's), the
> > crawl continued to next step... seems that this WARNING is actually
> > slowing down the whole crawling process (it took 36 minutes to
> > complete the previous fetch) with just 3 urls seed file :-!!!
> >
> > I just posted a couple of exceptions/questions regarding DFS on hadoop
> > core mailing list.
> >
> > PD: As a side note, the following error caught my attention:
> >
> > Fetcher: starting
> > Fetcher: segment: crawl-ecxi/segments/20080715172458
> > Too many fetch-failures
> > task_200807151723_0005_m_000000_0: Fetcher: threads: 10
> > task_200807151723_0005_m_000000_0: fetching http://upc.es/
> > task_200807151723_0005_m_000000_0: fetching http://upc.edu/
> > task_200807151723_0005_m_000000_0: fetching http://upc.cat/
> > task_200807151723_0005_m_000000_0: fetch of http://upc.cat/ failed
> > with: org.apache.nutch.protocol.http.api.HttpException:
> > java.net.UnknownHostException: upc.cat
> >
> > Unknown host ?¿ Just try "http://upc.cat"; on your browser, it *does*
> > exist, it just gets redirected to www.upc.cat :-/
> >
> > On Tue, Jul 15, 2008 at 5:42 PM, brainstorm <[EMAIL PROTECTED]> wrote:
> >> Yep, I know about wireshark, and wanted to avoid it to debug this
> >> issue (perhaps there was a simple solution/known bug/issue)...
> >>
> >> I just launched wireshark on frontend with filter tcp.port == 50010,
> >> and now I'm diving on the tcp stream... let's see if I see the light
> >> (RST flag somewhere ?), thanks anyway for replying ;)
> >>
> >> Just for the record, the phase that stalls is fetcher during reduce:
> >>
> >> Jobid   User    Name    Map % Complete  Map Total       Maps Completed
>  Reduce %
> >> Complete        Reduce Total    Reduces Completed
> >> job_200807151723_0005   hadoop  fetch crawl-ecxi/segments/20080715172458
>        100.00%
> >>        2       2       16.66%
> >>
> >>        1       0
> >>
> >> It's stuck on 16%, no traffic, no crawling, but still "running".
> >>
> >> On Tue, Jul 15, 2008 at 4:28 PM, Patrick Markiewicz
> >> <[EMAIL PROTECTED]> wrote:
> >>> Hi brain,
> >>>        If I were you, I would download wireshark
> >>> (http://www.wireshark.org/download.html) to see what is happening at
> the
> >>> network layer and see if that provides any clues.  A socket exception
> >>> that you don't expect is usually due to one side of the conversation
> not
> >>> understanding the other side.  If you have 4 machines, then you have 4
> >>> possible places where default firewall rules could be causing an issue.
> >>> If it is not the firewall rules, the NAT rules could be a potential
> >>> source of error.  Also, even a router hardware error could cause a
> >>> problem.
> >>>        If you understand TCP, just make sure that you see all the
> >>> correct TCP stuff happening in wireshark.  If you don't understand
> >>> wireshark's display, let me know, and I'll pass on some quickstart
> >>> information.
> >>>
> >>>        If you already know all of this, I don't have any way to help
> >>> you, as it looks like you're trying to accomplish something trickier
> >>> with nutch than I have ever attempted.
> >>>
> >>> Patrick
> >>>
> >>> -----Original Message-----
> >>> From: brainstorm [mailto:[EMAIL PROTECTED]
> >>> Sent: Tuesday, July 15, 2008 10:08 AM
> >>> To: [email protected]
> >>> Subject: Re: Distributed fetching only happening in one node ?
> >>>
> >>> Boiling down the problem I'm stuck on this:
> >>>
> >>> 2008-07-14 16:43:24,976 WARN  dfs.DataNode -
> >>> 192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to
> >>> 192.168.0.252:50010 got java.net.SocketException: Connection reset
> >>>        at
> >>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
> >>>        at
> >>> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
> >>>        at
> >>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> >>>        at
> >>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> >>>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
> >>>        at
> >>>
> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602)
> >>>        at
> >>>
> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636)
> >>>        at
> >>> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391)
> >>>        at java.lang.Thread.run(Thread.java:595)
> >>>
> >>> Checked that firewall settings between node & frontend were not
> >>> blocking packets, and they don't... anyone knows why is this ? If not,
> >>> could you provide a convenient way to debug it ?
> >>>
> >>> Thanks !
> >>>
> >>> On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <[EMAIL PROTECTED]>
> wrote:
> >>>> Hi,
> >>>>
> >>>> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
> >>>> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
> >>>> best suited network topology for inet crawling (frontend being a net
> >>>> bottleneck), but I think it's fine for testing purposes.
> >>>>
> >>>> I'm having issues with fetch mapreduce job:
> >>>>
> >>>> According to ganglia monitoring (network traffic), and hadoop
> >>>> administrative interfaces, fetch phase is only being executed in the
> >>>> frontend node, where I launched "nutch crawl". Previous nutch phases
> >>>> were executed neatly distributed on all nodes:
> >>>>
> >>>> job_200807131223_0001   hadoop  inject urls     100.00%
> >>>>        2       2       100.00%
> >>>>        1       1
> >>>> job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb
> >>> 100.00%
> >>>>        3       3       100.00%
> >>>>        1       1
> >>>> job_200807131223_0003   hadoop  generate: select
> >>>> crawl-ecxi/segments/20080713123547      100.00%
> >>>>        3       3       100.00%
> >>>>        1       1
> >>>> job_200807131223_0004   hadoop  generate: partition
> >>>> crawl-ecxi/segments/20080713123547      100.00%
> >>>>        4       4       100.00%
> >>>>        2       2
> >>>>
> >>>> I've checked that:
> >>>>
> >>>> 1) Nodes have inet connectivity, firewall settings
> >>>> 2) There's enough space on local discs
> >>>> 3) Proper processes are running on nodes
> >>>>
> >>>> frontend-node:
> >>>> ==========
> >>>>
> >>>> [EMAIL PROTECTED] ~]# jps
> >>>> 29232 NameNode
> >>>> 29489 DataNode
> >>>> 29860 JobTracker
> >>>> 29778 SecondaryNameNode
> >>>> 31122 Crawl
> >>>> 30137 TaskTracker
> >>>> 10989 Jps
> >>>> 1818 TaskTracker$Child
> >>>>
> >>>> leaf nodes:
> >>>> ========
> >>>>
> >>>> [EMAIL PROTECTED] ~]# cluster-fork jps
> >>>> compute-0-1:
> >>>> 23929 Jps
> >>>> 15568 TaskTracker
> >>>> 15361 DataNode
> >>>> compute-0-2:
> >>>> 32272 TaskTracker
> >>>> 32065 DataNode
> >>>> 7197 Jps
> >>>> 2397 TaskTracker$Child
> >>>> compute-0-3:
> >>>> 12054 DataNode
> >>>> 19584 Jps
> >>>> 14824 TaskTracker$Child
> >>>> 12261 TaskTracker
> >>>>
> >>>> 4) Logs only show fetching process (taking place only in the head
> >>> node):
> >>>>
> >>>> 2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
> >>>> http://valleycycles.net/
> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
> >>>> robots.txt for http://www.getting-forward.org/:
> >>>> java.net.UnknownHostException: www.getting-forward.org
> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
> >>>> robots.txt for http://www.getting-forward.org/:
> >>>> java.net.UnknownHostException: www.getting-forward.org
> >>>>
> >>>> What am I missing ? Why there are no fetching instances on nodes ? I
> >>>> used the following custom script to launch a pristine crawl each time:
> >>>>
> >>>> #!/bin/sh
> >>>>
> >>>> # 1) Stops hadoop daemons
> >>>> # 2) Overwrites new url list on HDFS
> >>>> # 3) Starts hadoop daemons
> >>>> # 4) Performs a clean crawl
> >>>>
> >>>> #export JAVA_HOME=/usr/lib/jvm/java-6-sun
> >>>> export JAVA_HOME=/usr/java/jdk1.5.0_10
> >>>>
> >>>> CRAWL_DIR=crawl-ecxi || $1
> >>>> URL_DIR=urls || $2
> >>>>
> >>>> echo $CRAWL_DIR
> >>>> echo $URL_DIR
> >>>>
> >>>> echo "Leaving safe mode..."
> >>>> ./hadoop dfsadmin -safemode leave
> >>>>
> >>>> echo "Removing seed urls directory and previous crawled content..."
> >>>> ./hadoop dfs -rmr $URL_DIR
> >>>> ./hadoop dfs -rmr $CRAWL_DIR
> >>>>
> >>>> echo "Removing past logs"
> >>>>
> >>>> rm -rf ../logs/*
> >>>>
> >>>> echo "Uploading seed urls..."
> >>>> ./hadoop dfs -put ../$URL_DIR $URL_DIR
> >>>>
> >>>> #echo "Entering safe mode..."
> >>>> #./hadoop dfsadmin -safemode enter
> >>>>
> >>>> echo "******************"
> >>>> echo "* STARTING CRAWL *"
> >>>> echo "******************"
> >>>>
> >>>> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
> >>>>
> >>>>
> >>>> Next step I'm thinking on to fix the problem is to install
> >>>> nutch+hadoop as specified in this past nutch-user mail:
> >>>>
> >>>>
> http://www.mail-archive.com/[email protected]/msg10225.html
> >>>>
> >>>> As I don't know if it's current practice on trunk (archived mail is
> >>>> from Wed, 02 Jan 2008), I wanted to ask if there's another way to fix
> >>>> it or if it's being worked on by someone... I haven't found a matching
> >>>> bug on JIRA :_/
> >>>>
> >>>
> >>
> >
>



-- 
Best Regards
Alexander Aristov

Re: Distributed fetching only happening in one node ?

Reply via email to