Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with values 2 and 1 respectively *in the past*, same results. Right now, I have 32 for both: same results as those settings are just a hint for nutch.
Regarding number of threads *per host* I tried with 10 and 20 in the past, same results. I appreciate your support Alexander, thank you :) On Tue, Aug 5, 2008 at 9:17 AM, Alexander Aristov <[EMAIL PROTECTED]> wrote: > Still not clear. > > What values for mapred.map.tasks and mapred.reduce.tasks do you have now? > Check the hadoop-site.xml file as it may affect your configuration also. > > Alexander > > 2008/8/5 brainstorm <[EMAIL PROTECTED]> > >> Correction: Only 2 nodes doing map operation on fetch (nodes 7 and 2). >> >> On Tue, Aug 5, 2008 at 9:11 AM, brainstorm <[EMAIL PROTECTED]> wrote: >> > Right, I've checked before with mapred.map.tasks to 2 and >> > mapred.reduce.tasks to 1. >> > >> > I've also played with several values on the following settings: >> > >> > <property> >> > <name>fetcher.server.delay</name> >> > <value>1.5</value> >> > <description>The number of seconds the fetcher will delay between >> > successive requests to the same server.</description> >> > </property> >> > >> > <property> >> > <name>http.max.delays</name> >> > <value>3</value> >> > <description>The number of times a thread will delay when trying to >> > fetch a page. Each time it finds that a host is busy, it will wait >> > fetcher.server.delay. After http.max.delays attepts, it will give >> > up on the page for now.</description> >> > </property> >> > >> > Only one node executes the fetch phase anyway :_( >> > >> > Thanks for the hint anyway... more ideas ? >> > >> > On Tue, Aug 5, 2008 at 8:04 AM, Alexander Aristov >> > <[EMAIL PROTECTED]> wrote: >> >> Hi >> >> >> >> 1. You should have set >> >> mapred.map.tasks >> >> and >> >> mapred.reduce.tasks parameters They are set to 2 and 1 by default. >> >> >> >> 2. You can specify number of threads to perform fetching. Also there is >> a >> >> parameter that slows down fetching from one URL,so called polite >> fetching to >> >> not DOS the site. >> >> >> >> So check you configuration. >> >> >> >> Alex >> >> >> >> 2008/8/5 brainstorm <[EMAIL PROTECTED]> >> >> >> >>> Ok, DFS warnings problem solved, seems that hadoop-0.17.1 patch fixes >> >>> the warnings... BUT, on a 7-node nutch cluster: >> >>> >> >>> 1) Fetching is only happening on *one* node despite several values >> >>> tested on settings: >> >>> mapred.tasktracker.map.tasks.maximum >> >>> mapred.tasktracker.reduce.tasks.maximum >> >>> export HADOOP_HEAPSIZE >> >>> >> >>> I've played with mapreduce (hadoop-site.xml) settings as advised on: >> >>> >> >>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces >> >>> >> >>> But nutch keeps crawling only using one node, instead of seven >> >>> nodes... anybody knows why ? >> >>> >> >>> I've had a look at the code, searching for: >> >>> >> >>> conf.setNumMapTasks(int num), but found none: so I guess that the >> >>> number of mappers & reducers are not limited programatically. >> >>> >> >>> 2) Even on a single node, the fetching is really slow: 1 url or page >> >>> per second, at most. >> >>> >> >>> Can anybody shed some light into this ? Pointing which class/code I >> >>> should look into to modify this behaviour will help also. >> >>> >> >>> Anybody has a distributed nutch crawling cluster working with all >> >>> nodes fetching at fetch phase ? >> >>> >> >>> I even did some numbers using wordcount example using 7 nodes at 100% >> >>> cpu usage using a 425MB parsedtext file: >> >>> >> >>> maps reduces heapsize time >> >>> 2 2 500 3m43.049s >> >>> 4 4 500 4m41.846s >> >>> 8 8 500 4m29.344s >> >>> 16 16 500 3m43.672s >> >>> 32 32 500 3m41.367s >> >>> 64 64 500 4m27.275s >> >>> 128 128 500 4m35.233s >> >>> 256 256 500 3m41.916s >> >>> >> >>> >> >>> 2 2 2000 4m31.434s >> >>> 4 4 2000 >> >>> 8 8 2000 >> >>> 16 16 2000 4m32.213s >> >>> 32 32 2000 >> >>> 64 64 2000 >> >>> 128 128 2000 >> >>> 256 256 2000 4m38.310s >> >>> >> >>> Thanks in advance, >> >>> Roman >> >>> >> >>> On Tue, Jul 15, 2008 at 7:15 PM, brainstorm <[EMAIL PROTECTED]> >> wrote: >> >>> > While seeing DFS wireshark trace (and the corresponding RST's), the >> >>> > crawl continued to next step... seems that this WARNING is actually >> >>> > slowing down the whole crawling process (it took 36 minutes to >> >>> > complete the previous fetch) with just 3 urls seed file :-!!! >> >>> > >> >>> > I just posted a couple of exceptions/questions regarding DFS on >> hadoop >> >>> > core mailing list. >> >>> > >> >>> > PD: As a side note, the following error caught my attention: >> >>> > >> >>> > Fetcher: starting >> >>> > Fetcher: segment: crawl-ecxi/segments/20080715172458 >> >>> > Too many fetch-failures >> >>> > task_200807151723_0005_m_000000_0: Fetcher: threads: 10 >> >>> > task_200807151723_0005_m_000000_0: fetching http://upc.es/ >> >>> > task_200807151723_0005_m_000000_0: fetching http://upc.edu/ >> >>> > task_200807151723_0005_m_000000_0: fetching http://upc.cat/ >> >>> > task_200807151723_0005_m_000000_0: fetch of http://upc.cat/ failed >> >>> > with: org.apache.nutch.protocol.http.api.HttpException: >> >>> > java.net.UnknownHostException: upc.cat >> >>> > >> >>> > Unknown host ?¿ Just try "http://upc.cat" on your browser, it *does* >> >>> > exist, it just gets redirected to www.upc.cat :-/ >> >>> > >> >>> > On Tue, Jul 15, 2008 at 5:42 PM, brainstorm <[EMAIL PROTECTED]> >> wrote: >> >>> >> Yep, I know about wireshark, and wanted to avoid it to debug this >> >>> >> issue (perhaps there was a simple solution/known bug/issue)... >> >>> >> >> >>> >> I just launched wireshark on frontend with filter tcp.port == 50010, >> >>> >> and now I'm diving on the tcp stream... let's see if I see the light >> >>> >> (RST flag somewhere ?), thanks anyway for replying ;) >> >>> >> >> >>> >> Just for the record, the phase that stalls is fetcher during reduce: >> >>> >> >> >>> >> Jobid User Name Map % Complete Map Total Maps >> Completed >> >>> Reduce % >> >>> >> Complete Reduce Total Reduces Completed >> >>> >> job_200807151723_0005 hadoop fetch >> crawl-ecxi/segments/20080715172458 >> >>> 100.00% >> >>> >> 2 2 16.66% >> >>> >> >> >>> >> 1 0 >> >>> >> >> >>> >> It's stuck on 16%, no traffic, no crawling, but still "running". >> >>> >> >> >>> >> On Tue, Jul 15, 2008 at 4:28 PM, Patrick Markiewicz >> >>> >> <[EMAIL PROTECTED]> wrote: >> >>> >>> Hi brain, >> >>> >>> If I were you, I would download wireshark >> >>> >>> (http://www.wireshark.org/download.html) to see what is happening >> at >> >>> the >> >>> >>> network layer and see if that provides any clues. A socket >> exception >> >>> >>> that you don't expect is usually due to one side of the >> conversation >> >>> not >> >>> >>> understanding the other side. If you have 4 machines, then you >> have 4 >> >>> >>> possible places where default firewall rules could be causing an >> issue. >> >>> >>> If it is not the firewall rules, the NAT rules could be a potential >> >>> >>> source of error. Also, even a router hardware error could cause a >> >>> >>> problem. >> >>> >>> If you understand TCP, just make sure that you see all the >> >>> >>> correct TCP stuff happening in wireshark. If you don't understand >> >>> >>> wireshark's display, let me know, and I'll pass on some quickstart >> >>> >>> information. >> >>> >>> >> >>> >>> If you already know all of this, I don't have any way to >> help >> >>> >>> you, as it looks like you're trying to accomplish something >> trickier >> >>> >>> with nutch than I have ever attempted. >> >>> >>> >> >>> >>> Patrick >> >>> >>> >> >>> >>> -----Original Message----- >> >>> >>> From: brainstorm [mailto:[EMAIL PROTECTED] >> >>> >>> Sent: Tuesday, July 15, 2008 10:08 AM >> >>> >>> To: [email protected] >> >>> >>> Subject: Re: Distributed fetching only happening in one node ? >> >>> >>> >> >>> >>> Boiling down the problem I'm stuck on this: >> >>> >>> >> >>> >>> 2008-07-14 16:43:24,976 WARN dfs.DataNode - >> >>> >>> 192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to >> >>> >>> 192.168.0.252:50010 got java.net.SocketException: Connection reset >> >>> >>> at >> >>> >>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96) >> >>> >>> at >> >>> >>> java.net.SocketOutputStream.write(SocketOutputStream.java:136) >> >>> >>> at >> >>> >>> >> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) >> >>> >>> at >> >>> >>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) >> >>> >>> at java.io.DataOutputStream.write(DataOutputStream.java:90) >> >>> >>> at >> >>> >>> >> >>> >> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602) >> >>> >>> at >> >>> >>> >> >>> >> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636) >> >>> >>> at >> >>> >>> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391) >> >>> >>> at java.lang.Thread.run(Thread.java:595) >> >>> >>> >> >>> >>> Checked that firewall settings between node & frontend were not >> >>> >>> blocking packets, and they don't... anyone knows why is this ? If >> not, >> >>> >>> could you provide a convenient way to debug it ? >> >>> >>> >> >>> >>> Thanks ! >> >>> >>> >> >>> >>> On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <[EMAIL PROTECTED]> >> >>> wrote: >> >>> >>>> Hi, >> >>> >>>> >> >>> >>>> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks >> >>> >>>> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the >> >>> >>>> best suited network topology for inet crawling (frontend being a >> net >> >>> >>>> bottleneck), but I think it's fine for testing purposes. >> >>> >>>> >> >>> >>>> I'm having issues with fetch mapreduce job: >> >>> >>>> >> >>> >>>> According to ganglia monitoring (network traffic), and hadoop >> >>> >>>> administrative interfaces, fetch phase is only being executed in >> the >> >>> >>>> frontend node, where I launched "nutch crawl". Previous nutch >> phases >> >>> >>>> were executed neatly distributed on all nodes: >> >>> >>>> >> >>> >>>> job_200807131223_0001 hadoop inject urls 100.00% >> >>> >>>> 2 2 100.00% >> >>> >>>> 1 1 >> >>> >>>> job_200807131223_0002 hadoop crawldb crawl-ecxi/crawldb >> >>> >>> 100.00% >> >>> >>>> 3 3 100.00% >> >>> >>>> 1 1 >> >>> >>>> job_200807131223_0003 hadoop generate: select >> >>> >>>> crawl-ecxi/segments/20080713123547 100.00% >> >>> >>>> 3 3 100.00% >> >>> >>>> 1 1 >> >>> >>>> job_200807131223_0004 hadoop generate: partition >> >>> >>>> crawl-ecxi/segments/20080713123547 100.00% >> >>> >>>> 4 4 100.00% >> >>> >>>> 2 2 >> >>> >>>> >> >>> >>>> I've checked that: >> >>> >>>> >> >>> >>>> 1) Nodes have inet connectivity, firewall settings >> >>> >>>> 2) There's enough space on local discs >> >>> >>>> 3) Proper processes are running on nodes >> >>> >>>> >> >>> >>>> frontend-node: >> >>> >>>> ========== >> >>> >>>> >> >>> >>>> [EMAIL PROTECTED] ~]# jps >> >>> >>>> 29232 NameNode >> >>> >>>> 29489 DataNode >> >>> >>>> 29860 JobTracker >> >>> >>>> 29778 SecondaryNameNode >> >>> >>>> 31122 Crawl >> >>> >>>> 30137 TaskTracker >> >>> >>>> 10989 Jps >> >>> >>>> 1818 TaskTracker$Child >> >>> >>>> >> >>> >>>> leaf nodes: >> >>> >>>> ======== >> >>> >>>> >> >>> >>>> [EMAIL PROTECTED] ~]# cluster-fork jps >> >>> >>>> compute-0-1: >> >>> >>>> 23929 Jps >> >>> >>>> 15568 TaskTracker >> >>> >>>> 15361 DataNode >> >>> >>>> compute-0-2: >> >>> >>>> 32272 TaskTracker >> >>> >>>> 32065 DataNode >> >>> >>>> 7197 Jps >> >>> >>>> 2397 TaskTracker$Child >> >>> >>>> compute-0-3: >> >>> >>>> 12054 DataNode >> >>> >>>> 19584 Jps >> >>> >>>> 14824 TaskTracker$Child >> >>> >>>> 12261 TaskTracker >> >>> >>>> >> >>> >>>> 4) Logs only show fetching process (taking place only in the head >> >>> >>> node): >> >>> >>>> >> >>> >>>> 2008-07-13 13:33:22,306 INFO fetcher.Fetcher - fetching >> >>> >>>> http://valleycycles.net/ >> >>> >>>> 2008-07-13 13:33:22,349 INFO api.RobotRulesParser - Couldn't get >> >>> >>>> robots.txt for http://www.getting-forward.org/: >> >>> >>>> java.net.UnknownHostException: www.getting-forward.org >> >>> >>>> 2008-07-13 13:33:22,349 INFO api.RobotRulesParser - Couldn't get >> >>> >>>> robots.txt for http://www.getting-forward.org/: >> >>> >>>> java.net.UnknownHostException: www.getting-forward.org >> >>> >>>> >> >>> >>>> What am I missing ? Why there are no fetching instances on nodes ? >> I >> >>> >>>> used the following custom script to launch a pristine crawl each >> time: >> >>> >>>> >> >>> >>>> #!/bin/sh >> >>> >>>> >> >>> >>>> # 1) Stops hadoop daemons >> >>> >>>> # 2) Overwrites new url list on HDFS >> >>> >>>> # 3) Starts hadoop daemons >> >>> >>>> # 4) Performs a clean crawl >> >>> >>>> >> >>> >>>> #export JAVA_HOME=/usr/lib/jvm/java-6-sun >> >>> >>>> export JAVA_HOME=/usr/java/jdk1.5.0_10 >> >>> >>>> >> >>> >>>> CRAWL_DIR=crawl-ecxi || $1 >> >>> >>>> URL_DIR=urls || $2 >> >>> >>>> >> >>> >>>> echo $CRAWL_DIR >> >>> >>>> echo $URL_DIR >> >>> >>>> >> >>> >>>> echo "Leaving safe mode..." >> >>> >>>> ./hadoop dfsadmin -safemode leave >> >>> >>>> >> >>> >>>> echo "Removing seed urls directory and previous crawled >> content..." >> >>> >>>> ./hadoop dfs -rmr $URL_DIR >> >>> >>>> ./hadoop dfs -rmr $CRAWL_DIR >> >>> >>>> >> >>> >>>> echo "Removing past logs" >> >>> >>>> >> >>> >>>> rm -rf ../logs/* >> >>> >>>> >> >>> >>>> echo "Uploading seed urls..." >> >>> >>>> ./hadoop dfs -put ../$URL_DIR $URL_DIR >> >>> >>>> >> >>> >>>> #echo "Entering safe mode..." >> >>> >>>> #./hadoop dfsadmin -safemode enter >> >>> >>>> >> >>> >>>> echo "******************" >> >>> >>>> echo "* STARTING CRAWL *" >> >>> >>>> echo "******************" >> >>> >>>> >> >>> >>>> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3 >> >>> >>>> >> >>> >>>> >> >>> >>>> Next step I'm thinking on to fix the problem is to install >> >>> >>>> nutch+hadoop as specified in this past nutch-user mail: >> >>> >>>> >> >>> >>>> >> >>> http://www.mail-archive.com/[email protected]/msg10225.html >> >>> >>>> >> >>> >>>> As I don't know if it's current practice on trunk (archived mail >> is >> >>> >>>> from Wed, 02 Jan 2008), I wanted to ask if there's another way to >> fix >> >>> >>>> it or if it's being worked on by someone... I haven't found a >> matching >> >>> >>>> bug on JIRA :_/ >> >>> >>>> >> >>> >>> >> >>> >> >> >>> > >> >>> >> >> >> >> >> >> >> >> -- >> >> Best Regards >> >> Alexander Aristov >> >> >> > >> > > > > -- > Best Regards > Alexander Aristov >
