Re: Distributed fetching only happening in one node ?

brainstorm Tue, 05 Aug 2008 00:27:14 -0700

Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with
values 2 and 1 respectively *in the past*, same results. Right now, I
have 32 for both: same results as those settings are just a hint for
nutch.


Regarding number of threads *per host* I tried with 10 and 20 in the
past, same results.

I appreciate your support Alexander, thank you :)

On Tue, Aug 5, 2008 at 9:17 AM, Alexander Aristov
<[EMAIL PROTECTED]> wrote:
> Still not clear.
>
> What values for mapred.map.tasks and mapred.reduce.tasks do you have now?
> Check the hadoop-site.xml file as it may affect your configuration also.
>
> Alexander
>
> 2008/8/5 brainstorm <[EMAIL PROTECTED]>
>
>> Correction: Only 2 nodes doing map operation on fetch (nodes 7 and 2).
>>
>> On Tue, Aug 5, 2008 at 9:11 AM, brainstorm <[EMAIL PROTECTED]> wrote:
>> > Right, I've checked before with mapred.map.tasks to 2 and
>> > mapred.reduce.tasks to 1.
>> >
>> > I've also played with several values on the following settings:
>> >
>> > <property>
>> >  <name>fetcher.server.delay</name>
>> >  <value>1.5</value>
>> >  <description>The number of seconds the fetcher will delay between
>> >   successive requests to the same server.</description>
>> > </property>
>> >
>> > <property>
>> >  <name>http.max.delays</name>
>> >  <value>3</value>
>> >  <description>The number of times a thread will delay when trying to
>> >  fetch a page.  Each time it finds that a host is busy, it will wait
>> >  fetcher.server.delay.  After http.max.delays attepts, it will give
>> >  up on the page for now.</description>
>> > </property>
>> >
>> > Only one node executes the fetch phase anyway :_(
>> >
>> > Thanks for the hint anyway... more ideas ?
>> >
>> > On Tue, Aug 5, 2008 at 8:04 AM, Alexander Aristov
>> > <[EMAIL PROTECTED]> wrote:
>> >> Hi
>> >>
>> >> 1. You should have set
>> >> mapred.map.tasks
>> >> and
>> >> mapred.reduce.tasks parameters They are set to 2 and 1 by default.
>> >>
>> >> 2. You can specify number of threads to perform fetching. Also there is
>> a
>> >> parameter that slows down fetching from one URL,so called polite
>> fetching to
>> >> not DOS the site.
>> >>
>> >> So check you configuration.
>> >>
>> >> Alex
>> >>
>> >> 2008/8/5 brainstorm <[EMAIL PROTECTED]>
>> >>
>> >>> Ok, DFS warnings problem solved, seems that hadoop-0.17.1 patch fixes
>> >>> the warnings... BUT, on a 7-node nutch cluster:
>> >>>
>> >>> 1) Fetching is only happening on *one* node despite several values
>> >>> tested on settings:
>> >>> mapred.tasktracker.map.tasks.maximum
>> >>> mapred.tasktracker.reduce.tasks.maximum
>> >>> export HADOOP_HEAPSIZE
>> >>>
>> >>> I've played with mapreduce (hadoop-site.xml) settings as advised on:
>> >>>
>> >>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>> >>>
>> >>> But nutch keeps crawling only using one node, instead of seven
>> >>> nodes... anybody knows why ?
>> >>>
>> >>> I've had a look at the code, searching for:
>> >>>
>> >>> conf.setNumMapTasks(int num), but found none: so I guess that the
>> >>> number of mappers & reducers are not limited programatically.
>> >>>
>> >>> 2) Even on a single node, the fetching is really slow: 1 url or page
>> >>> per second, at most.
>> >>>
>> >>> Can anybody shed some light into this ? Pointing which class/code I
>> >>> should look into to modify this behaviour will help also.
>> >>>
>> >>> Anybody has a distributed nutch crawling cluster working with all
>> >>> nodes fetching at fetch phase ?
>> >>>
>> >>> I even did some numbers using wordcount example using 7 nodes at 100%
>> >>> cpu usage using a 425MB parsedtext file:
>> >>>
>> >>> maps    reduces heapsize        time
>> >>> 2       2       500     3m43.049s
>> >>> 4       4       500     4m41.846s
>> >>> 8       8       500     4m29.344s
>> >>> 16      16      500     3m43.672s
>> >>> 32      32      500     3m41.367s
>> >>> 64      64      500     4m27.275s
>> >>> 128     128     500     4m35.233s
>> >>> 256     256     500     3m41.916s
>> >>>
>> >>>
>> >>> 2       2       2000    4m31.434s
>> >>> 4       4       2000
>> >>> 8       8       2000
>> >>> 16      16      2000    4m32.213s
>> >>> 32      32      2000
>> >>> 64      64      2000
>> >>> 128     128     2000
>> >>> 256     256     2000    4m38.310s
>> >>>
>> >>> Thanks in advance,
>> >>> Roman
>> >>>
>> >>> On Tue, Jul 15, 2008 at 7:15 PM, brainstorm <[EMAIL PROTECTED]>
>> wrote:
>> >>> > While seeing DFS wireshark trace (and the corresponding RST's), the
>> >>> > crawl continued to next step... seems that this WARNING is actually
>> >>> > slowing down the whole crawling process (it took 36 minutes to
>> >>> > complete the previous fetch) with just 3 urls seed file :-!!!
>> >>> >
>> >>> > I just posted a couple of exceptions/questions regarding DFS on
>> hadoop
>> >>> > core mailing list.
>> >>> >
>> >>> > PD: As a side note, the following error caught my attention:
>> >>> >
>> >>> > Fetcher: starting
>> >>> > Fetcher: segment: crawl-ecxi/segments/20080715172458
>> >>> > Too many fetch-failures
>> >>> > task_200807151723_0005_m_000000_0: Fetcher: threads: 10
>> >>> > task_200807151723_0005_m_000000_0: fetching http://upc.es/
>> >>> > task_200807151723_0005_m_000000_0: fetching http://upc.edu/
>> >>> > task_200807151723_0005_m_000000_0: fetching http://upc.cat/
>> >>> > task_200807151723_0005_m_000000_0: fetch of http://upc.cat/ failed
>> >>> > with: org.apache.nutch.protocol.http.api.HttpException:
>> >>> > java.net.UnknownHostException: upc.cat
>> >>> >
>> >>> > Unknown host ?¿ Just try "http://upc.cat"; on your browser, it *does*
>> >>> > exist, it just gets redirected to www.upc.cat :-/
>> >>> >
>> >>> > On Tue, Jul 15, 2008 at 5:42 PM, brainstorm <[EMAIL PROTECTED]>
>> wrote:
>> >>> >> Yep, I know about wireshark, and wanted to avoid it to debug this
>> >>> >> issue (perhaps there was a simple solution/known bug/issue)...
>> >>> >>
>> >>> >> I just launched wireshark on frontend with filter tcp.port == 50010,
>> >>> >> and now I'm diving on the tcp stream... let's see if I see the light
>> >>> >> (RST flag somewhere ?), thanks anyway for replying ;)
>> >>> >>
>> >>> >> Just for the record, the phase that stalls is fetcher during reduce:
>> >>> >>
>> >>> >> Jobid   User    Name    Map % Complete  Map Total       Maps
>> Completed
>> >>>  Reduce %
>> >>> >> Complete        Reduce Total    Reduces Completed
>> >>> >> job_200807151723_0005   hadoop  fetch
>> crawl-ecxi/segments/20080715172458
>> >>>        100.00%
>> >>> >>        2       2       16.66%
>> >>> >>
>> >>> >>        1       0
>> >>> >>
>> >>> >> It's stuck on 16%, no traffic, no crawling, but still "running".
>> >>> >>
>> >>> >> On Tue, Jul 15, 2008 at 4:28 PM, Patrick Markiewicz
>> >>> >> <[EMAIL PROTECTED]> wrote:
>> >>> >>> Hi brain,
>> >>> >>>        If I were you, I would download wireshark
>> >>> >>> (http://www.wireshark.org/download.html) to see what is happening
>> at
>> >>> the
>> >>> >>> network layer and see if that provides any clues.  A socket
>> exception
>> >>> >>> that you don't expect is usually due to one side of the
>> conversation
>> >>> not
>> >>> >>> understanding the other side.  If you have 4 machines, then you
>> have 4
>> >>> >>> possible places where default firewall rules could be causing an
>> issue.
>> >>> >>> If it is not the firewall rules, the NAT rules could be a potential
>> >>> >>> source of error.  Also, even a router hardware error could cause a
>> >>> >>> problem.
>> >>> >>>        If you understand TCP, just make sure that you see all the
>> >>> >>> correct TCP stuff happening in wireshark.  If you don't understand
>> >>> >>> wireshark's display, let me know, and I'll pass on some quickstart
>> >>> >>> information.
>> >>> >>>
>> >>> >>>        If you already know all of this, I don't have any way to
>> help
>> >>> >>> you, as it looks like you're trying to accomplish something
>> trickier
>> >>> >>> with nutch than I have ever attempted.
>> >>> >>>
>> >>> >>> Patrick
>> >>> >>>
>> >>> >>> -----Original Message-----
>> >>> >>> From: brainstorm [mailto:[EMAIL PROTECTED]
>> >>> >>> Sent: Tuesday, July 15, 2008 10:08 AM
>> >>> >>> To: [email protected]
>> >>> >>> Subject: Re: Distributed fetching only happening in one node ?
>> >>> >>>
>> >>> >>> Boiling down the problem I'm stuck on this:
>> >>> >>>
>> >>> >>> 2008-07-14 16:43:24,976 WARN  dfs.DataNode -
>> >>> >>> 192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to
>> >>> >>> 192.168.0.252:50010 got java.net.SocketException: Connection reset
>> >>> >>>        at
>> >>> >>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
>> >>> >>>        at
>> >>> >>> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>> >>> >>>        at
>> >>> >>>
>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>> >>> >>>        at
>> >>> >>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>> >>> >>>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>> >>> >>>        at
>> >>> >>>
>> >>>
>> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602)
>> >>> >>>        at
>> >>> >>>
>> >>>
>> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636)
>> >>> >>>        at
>> >>> >>> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391)
>> >>> >>>        at java.lang.Thread.run(Thread.java:595)
>> >>> >>>
>> >>> >>> Checked that firewall settings between node & frontend were not
>> >>> >>> blocking packets, and they don't... anyone knows why is this ? If
>> not,
>> >>> >>> could you provide a convenient way to debug it ?
>> >>> >>>
>> >>> >>> Thanks !
>> >>> >>>
>> >>> >>> On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <[EMAIL PROTECTED]>
>> >>> wrote:
>> >>> >>>> Hi,
>> >>> >>>>
>> >>> >>>> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
>> >>> >>>> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
>> >>> >>>> best suited network topology for inet crawling (frontend being a
>> net
>> >>> >>>> bottleneck), but I think it's fine for testing purposes.
>> >>> >>>>
>> >>> >>>> I'm having issues with fetch mapreduce job:
>> >>> >>>>
>> >>> >>>> According to ganglia monitoring (network traffic), and hadoop
>> >>> >>>> administrative interfaces, fetch phase is only being executed in
>> the
>> >>> >>>> frontend node, where I launched "nutch crawl". Previous nutch
>> phases
>> >>> >>>> were executed neatly distributed on all nodes:
>> >>> >>>>
>> >>> >>>> job_200807131223_0001   hadoop  inject urls     100.00%
>> >>> >>>>        2       2       100.00%
>> >>> >>>>        1       1
>> >>> >>>> job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb
>> >>> >>> 100.00%
>> >>> >>>>        3       3       100.00%
>> >>> >>>>        1       1
>> >>> >>>> job_200807131223_0003   hadoop  generate: select
>> >>> >>>> crawl-ecxi/segments/20080713123547      100.00%
>> >>> >>>>        3       3       100.00%
>> >>> >>>>        1       1
>> >>> >>>> job_200807131223_0004   hadoop  generate: partition
>> >>> >>>> crawl-ecxi/segments/20080713123547      100.00%
>> >>> >>>>        4       4       100.00%
>> >>> >>>>        2       2
>> >>> >>>>
>> >>> >>>> I've checked that:
>> >>> >>>>
>> >>> >>>> 1) Nodes have inet connectivity, firewall settings
>> >>> >>>> 2) There's enough space on local discs
>> >>> >>>> 3) Proper processes are running on nodes
>> >>> >>>>
>> >>> >>>> frontend-node:
>> >>> >>>> ==========
>> >>> >>>>
>> >>> >>>> [EMAIL PROTECTED] ~]# jps
>> >>> >>>> 29232 NameNode
>> >>> >>>> 29489 DataNode
>> >>> >>>> 29860 JobTracker
>> >>> >>>> 29778 SecondaryNameNode
>> >>> >>>> 31122 Crawl
>> >>> >>>> 30137 TaskTracker
>> >>> >>>> 10989 Jps
>> >>> >>>> 1818 TaskTracker$Child
>> >>> >>>>
>> >>> >>>> leaf nodes:
>> >>> >>>> ========
>> >>> >>>>
>> >>> >>>> [EMAIL PROTECTED] ~]# cluster-fork jps
>> >>> >>>> compute-0-1:
>> >>> >>>> 23929 Jps
>> >>> >>>> 15568 TaskTracker
>> >>> >>>> 15361 DataNode
>> >>> >>>> compute-0-2:
>> >>> >>>> 32272 TaskTracker
>> >>> >>>> 32065 DataNode
>> >>> >>>> 7197 Jps
>> >>> >>>> 2397 TaskTracker$Child
>> >>> >>>> compute-0-3:
>> >>> >>>> 12054 DataNode
>> >>> >>>> 19584 Jps
>> >>> >>>> 14824 TaskTracker$Child
>> >>> >>>> 12261 TaskTracker
>> >>> >>>>
>> >>> >>>> 4) Logs only show fetching process (taking place only in the head
>> >>> >>> node):
>> >>> >>>>
>> >>> >>>> 2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
>> >>> >>>> http://valleycycles.net/
>> >>> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>> >>> >>>> robots.txt for http://www.getting-forward.org/:
>> >>> >>>> java.net.UnknownHostException: www.getting-forward.org
>> >>> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>> >>> >>>> robots.txt for http://www.getting-forward.org/:
>> >>> >>>> java.net.UnknownHostException: www.getting-forward.org
>> >>> >>>>
>> >>> >>>> What am I missing ? Why there are no fetching instances on nodes ?
>> I
>> >>> >>>> used the following custom script to launch a pristine crawl each
>> time:
>> >>> >>>>
>> >>> >>>> #!/bin/sh
>> >>> >>>>
>> >>> >>>> # 1) Stops hadoop daemons
>> >>> >>>> # 2) Overwrites new url list on HDFS
>> >>> >>>> # 3) Starts hadoop daemons
>> >>> >>>> # 4) Performs a clean crawl
>> >>> >>>>
>> >>> >>>> #export JAVA_HOME=/usr/lib/jvm/java-6-sun
>> >>> >>>> export JAVA_HOME=/usr/java/jdk1.5.0_10
>> >>> >>>>
>> >>> >>>> CRAWL_DIR=crawl-ecxi || $1
>> >>> >>>> URL_DIR=urls || $2
>> >>> >>>>
>> >>> >>>> echo $CRAWL_DIR
>> >>> >>>> echo $URL_DIR
>> >>> >>>>
>> >>> >>>> echo "Leaving safe mode..."
>> >>> >>>> ./hadoop dfsadmin -safemode leave
>> >>> >>>>
>> >>> >>>> echo "Removing seed urls directory and previous crawled
>> content..."
>> >>> >>>> ./hadoop dfs -rmr $URL_DIR
>> >>> >>>> ./hadoop dfs -rmr $CRAWL_DIR
>> >>> >>>>
>> >>> >>>> echo "Removing past logs"
>> >>> >>>>
>> >>> >>>> rm -rf ../logs/*
>> >>> >>>>
>> >>> >>>> echo "Uploading seed urls..."
>> >>> >>>> ./hadoop dfs -put ../$URL_DIR $URL_DIR
>> >>> >>>>
>> >>> >>>> #echo "Entering safe mode..."
>> >>> >>>> #./hadoop dfsadmin -safemode enter
>> >>> >>>>
>> >>> >>>> echo "******************"
>> >>> >>>> echo "* STARTING CRAWL *"
>> >>> >>>> echo "******************"
>> >>> >>>>
>> >>> >>>> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> Next step I'm thinking on to fix the problem is to install
>> >>> >>>> nutch+hadoop as specified in this past nutch-user mail:
>> >>> >>>>
>> >>> >>>>
>> >>> http://www.mail-archive.com/[email protected]/msg10225.html
>> >>> >>>>
>> >>> >>>> As I don't know if it's current practice on trunk (archived mail
>> is
>> >>> >>>> from Wed, 02 Jan 2008), I wanted to ask if there's another way to
>> fix
>> >>> >>>> it or if it's being worked on by someone... I haven't found a
>> matching
>> >>> >>>> bug on JIRA :_/
>> >>> >>>>
>> >>> >>>
>> >>> >>
>> >>> >
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Best Regards
>> >> Alexander Aristov
>> >>
>> >
>>
>
>
>
> --
> Best Regards
> Alexander Aristov
>

Re: Distributed fetching only happening in one node ?

Reply via email to