Hi 1. You should have set mapred.map.tasks and mapred.reduce.tasks parameters They are set to 2 and 1 by default.
2. You can specify number of threads to perform fetching. Also there is a parameter that slows down fetching from one URL,so called polite fetching to not DOS the site. So check you configuration. Alex 2008/8/5 brainstorm <[EMAIL PROTECTED]> > Ok, DFS warnings problem solved, seems that hadoop-0.17.1 patch fixes > the warnings... BUT, on a 7-node nutch cluster: > > 1) Fetching is only happening on *one* node despite several values > tested on settings: > mapred.tasktracker.map.tasks.maximum > mapred.tasktracker.reduce.tasks.maximum > export HADOOP_HEAPSIZE > > I've played with mapreduce (hadoop-site.xml) settings as advised on: > > http://wiki.apache.org/hadoop/HowManyMapsAndReduces > > But nutch keeps crawling only using one node, instead of seven > nodes... anybody knows why ? > > I've had a look at the code, searching for: > > conf.setNumMapTasks(int num), but found none: so I guess that the > number of mappers & reducers are not limited programatically. > > 2) Even on a single node, the fetching is really slow: 1 url or page > per second, at most. > > Can anybody shed some light into this ? Pointing which class/code I > should look into to modify this behaviour will help also. > > Anybody has a distributed nutch crawling cluster working with all > nodes fetching at fetch phase ? > > I even did some numbers using wordcount example using 7 nodes at 100% > cpu usage using a 425MB parsedtext file: > > maps reduces heapsize time > 2 2 500 3m43.049s > 4 4 500 4m41.846s > 8 8 500 4m29.344s > 16 16 500 3m43.672s > 32 32 500 3m41.367s > 64 64 500 4m27.275s > 128 128 500 4m35.233s > 256 256 500 3m41.916s > > > 2 2 2000 4m31.434s > 4 4 2000 > 8 8 2000 > 16 16 2000 4m32.213s > 32 32 2000 > 64 64 2000 > 128 128 2000 > 256 256 2000 4m38.310s > > Thanks in advance, > Roman > > On Tue, Jul 15, 2008 at 7:15 PM, brainstorm <[EMAIL PROTECTED]> wrote: > > While seeing DFS wireshark trace (and the corresponding RST's), the > > crawl continued to next step... seems that this WARNING is actually > > slowing down the whole crawling process (it took 36 minutes to > > complete the previous fetch) with just 3 urls seed file :-!!! > > > > I just posted a couple of exceptions/questions regarding DFS on hadoop > > core mailing list. > > > > PD: As a side note, the following error caught my attention: > > > > Fetcher: starting > > Fetcher: segment: crawl-ecxi/segments/20080715172458 > > Too many fetch-failures > > task_200807151723_0005_m_000000_0: Fetcher: threads: 10 > > task_200807151723_0005_m_000000_0: fetching http://upc.es/ > > task_200807151723_0005_m_000000_0: fetching http://upc.edu/ > > task_200807151723_0005_m_000000_0: fetching http://upc.cat/ > > task_200807151723_0005_m_000000_0: fetch of http://upc.cat/ failed > > with: org.apache.nutch.protocol.http.api.HttpException: > > java.net.UnknownHostException: upc.cat > > > > Unknown host ?¿ Just try "http://upc.cat" on your browser, it *does* > > exist, it just gets redirected to www.upc.cat :-/ > > > > On Tue, Jul 15, 2008 at 5:42 PM, brainstorm <[EMAIL PROTECTED]> wrote: > >> Yep, I know about wireshark, and wanted to avoid it to debug this > >> issue (perhaps there was a simple solution/known bug/issue)... > >> > >> I just launched wireshark on frontend with filter tcp.port == 50010, > >> and now I'm diving on the tcp stream... let's see if I see the light > >> (RST flag somewhere ?), thanks anyway for replying ;) > >> > >> Just for the record, the phase that stalls is fetcher during reduce: > >> > >> Jobid User Name Map % Complete Map Total Maps Completed > Reduce % > >> Complete Reduce Total Reduces Completed > >> job_200807151723_0005 hadoop fetch crawl-ecxi/segments/20080715172458 > 100.00% > >> 2 2 16.66% > >> > >> 1 0 > >> > >> It's stuck on 16%, no traffic, no crawling, but still "running". > >> > >> On Tue, Jul 15, 2008 at 4:28 PM, Patrick Markiewicz > >> <[EMAIL PROTECTED]> wrote: > >>> Hi brain, > >>> If I were you, I would download wireshark > >>> (http://www.wireshark.org/download.html) to see what is happening at > the > >>> network layer and see if that provides any clues. A socket exception > >>> that you don't expect is usually due to one side of the conversation > not > >>> understanding the other side. If you have 4 machines, then you have 4 > >>> possible places where default firewall rules could be causing an issue. > >>> If it is not the firewall rules, the NAT rules could be a potential > >>> source of error. Also, even a router hardware error could cause a > >>> problem. > >>> If you understand TCP, just make sure that you see all the > >>> correct TCP stuff happening in wireshark. If you don't understand > >>> wireshark's display, let me know, and I'll pass on some quickstart > >>> information. > >>> > >>> If you already know all of this, I don't have any way to help > >>> you, as it looks like you're trying to accomplish something trickier > >>> with nutch than I have ever attempted. > >>> > >>> Patrick > >>> > >>> -----Original Message----- > >>> From: brainstorm [mailto:[EMAIL PROTECTED] > >>> Sent: Tuesday, July 15, 2008 10:08 AM > >>> To: [email protected] > >>> Subject: Re: Distributed fetching only happening in one node ? > >>> > >>> Boiling down the problem I'm stuck on this: > >>> > >>> 2008-07-14 16:43:24,976 WARN dfs.DataNode - > >>> 192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to > >>> 192.168.0.252:50010 got java.net.SocketException: Connection reset > >>> at > >>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96) > >>> at > >>> java.net.SocketOutputStream.write(SocketOutputStream.java:136) > >>> at > >>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) > >>> at > >>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) > >>> at java.io.DataOutputStream.write(DataOutputStream.java:90) > >>> at > >>> > org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602) > >>> at > >>> > org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636) > >>> at > >>> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391) > >>> at java.lang.Thread.run(Thread.java:595) > >>> > >>> Checked that firewall settings between node & frontend were not > >>> blocking packets, and they don't... anyone knows why is this ? If not, > >>> could you provide a convenient way to debug it ? > >>> > >>> Thanks ! > >>> > >>> On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <[EMAIL PROTECTED]> > wrote: > >>>> Hi, > >>>> > >>>> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks > >>>> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the > >>>> best suited network topology for inet crawling (frontend being a net > >>>> bottleneck), but I think it's fine for testing purposes. > >>>> > >>>> I'm having issues with fetch mapreduce job: > >>>> > >>>> According to ganglia monitoring (network traffic), and hadoop > >>>> administrative interfaces, fetch phase is only being executed in the > >>>> frontend node, where I launched "nutch crawl". Previous nutch phases > >>>> were executed neatly distributed on all nodes: > >>>> > >>>> job_200807131223_0001 hadoop inject urls 100.00% > >>>> 2 2 100.00% > >>>> 1 1 > >>>> job_200807131223_0002 hadoop crawldb crawl-ecxi/crawldb > >>> 100.00% > >>>> 3 3 100.00% > >>>> 1 1 > >>>> job_200807131223_0003 hadoop generate: select > >>>> crawl-ecxi/segments/20080713123547 100.00% > >>>> 3 3 100.00% > >>>> 1 1 > >>>> job_200807131223_0004 hadoop generate: partition > >>>> crawl-ecxi/segments/20080713123547 100.00% > >>>> 4 4 100.00% > >>>> 2 2 > >>>> > >>>> I've checked that: > >>>> > >>>> 1) Nodes have inet connectivity, firewall settings > >>>> 2) There's enough space on local discs > >>>> 3) Proper processes are running on nodes > >>>> > >>>> frontend-node: > >>>> ========== > >>>> > >>>> [EMAIL PROTECTED] ~]# jps > >>>> 29232 NameNode > >>>> 29489 DataNode > >>>> 29860 JobTracker > >>>> 29778 SecondaryNameNode > >>>> 31122 Crawl > >>>> 30137 TaskTracker > >>>> 10989 Jps > >>>> 1818 TaskTracker$Child > >>>> > >>>> leaf nodes: > >>>> ======== > >>>> > >>>> [EMAIL PROTECTED] ~]# cluster-fork jps > >>>> compute-0-1: > >>>> 23929 Jps > >>>> 15568 TaskTracker > >>>> 15361 DataNode > >>>> compute-0-2: > >>>> 32272 TaskTracker > >>>> 32065 DataNode > >>>> 7197 Jps > >>>> 2397 TaskTracker$Child > >>>> compute-0-3: > >>>> 12054 DataNode > >>>> 19584 Jps > >>>> 14824 TaskTracker$Child > >>>> 12261 TaskTracker > >>>> > >>>> 4) Logs only show fetching process (taking place only in the head > >>> node): > >>>> > >>>> 2008-07-13 13:33:22,306 INFO fetcher.Fetcher - fetching > >>>> http://valleycycles.net/ > >>>> 2008-07-13 13:33:22,349 INFO api.RobotRulesParser - Couldn't get > >>>> robots.txt for http://www.getting-forward.org/: > >>>> java.net.UnknownHostException: www.getting-forward.org > >>>> 2008-07-13 13:33:22,349 INFO api.RobotRulesParser - Couldn't get > >>>> robots.txt for http://www.getting-forward.org/: > >>>> java.net.UnknownHostException: www.getting-forward.org > >>>> > >>>> What am I missing ? Why there are no fetching instances on nodes ? I > >>>> used the following custom script to launch a pristine crawl each time: > >>>> > >>>> #!/bin/sh > >>>> > >>>> # 1) Stops hadoop daemons > >>>> # 2) Overwrites new url list on HDFS > >>>> # 3) Starts hadoop daemons > >>>> # 4) Performs a clean crawl > >>>> > >>>> #export JAVA_HOME=/usr/lib/jvm/java-6-sun > >>>> export JAVA_HOME=/usr/java/jdk1.5.0_10 > >>>> > >>>> CRAWL_DIR=crawl-ecxi || $1 > >>>> URL_DIR=urls || $2 > >>>> > >>>> echo $CRAWL_DIR > >>>> echo $URL_DIR > >>>> > >>>> echo "Leaving safe mode..." > >>>> ./hadoop dfsadmin -safemode leave > >>>> > >>>> echo "Removing seed urls directory and previous crawled content..." > >>>> ./hadoop dfs -rmr $URL_DIR > >>>> ./hadoop dfs -rmr $CRAWL_DIR > >>>> > >>>> echo "Removing past logs" > >>>> > >>>> rm -rf ../logs/* > >>>> > >>>> echo "Uploading seed urls..." > >>>> ./hadoop dfs -put ../$URL_DIR $URL_DIR > >>>> > >>>> #echo "Entering safe mode..." > >>>> #./hadoop dfsadmin -safemode enter > >>>> > >>>> echo "******************" > >>>> echo "* STARTING CRAWL *" > >>>> echo "******************" > >>>> > >>>> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3 > >>>> > >>>> > >>>> Next step I'm thinking on to fix the problem is to install > >>>> nutch+hadoop as specified in this past nutch-user mail: > >>>> > >>>> > http://www.mail-archive.com/[email protected]/msg10225.html > >>>> > >>>> As I don't know if it's current practice on trunk (archived mail is > >>>> from Wed, 02 Jan 2008), I wanted to ask if there's another way to fix > >>>> it or if it's being worked on by someone... I haven't found a matching > >>>> bug on JIRA :_/ > >>>> > >>> > >> > > > -- Best Regards Alexander Aristov
