While seeing DFS wireshark trace (and the corresponding RST's), the crawl continued to next step... seems that this WARNING is actually slowing down the whole crawling process (it took 36 minutes to complete the previous fetch) with just 3 urls seed file :-!!!
I just posted a couple of exceptions/questions regarding DFS on hadoop core mailing list. PD: As a side note, the following error caught my attention: Fetcher: starting Fetcher: segment: crawl-ecxi/segments/20080715172458 Too many fetch-failures task_200807151723_0005_m_000000_0: Fetcher: threads: 10 task_200807151723_0005_m_000000_0: fetching http://upc.es/ task_200807151723_0005_m_000000_0: fetching http://upc.edu/ task_200807151723_0005_m_000000_0: fetching http://upc.cat/ task_200807151723_0005_m_000000_0: fetch of http://upc.cat/ failed with: org.apache.nutch.protocol.http.api.HttpException: java.net.UnknownHostException: upc.cat Unknown host ?¿ Just try "http://upc.cat" on your browser, it *does* exist, it just gets redirected to www.upc.cat :-/ On Tue, Jul 15, 2008 at 5:42 PM, brainstorm <[EMAIL PROTECTED]> wrote: > Yep, I know about wireshark, and wanted to avoid it to debug this > issue (perhaps there was a simple solution/known bug/issue)... > > I just launched wireshark on frontend with filter tcp.port == 50010, > and now I'm diving on the tcp stream... let's see if I see the light > (RST flag somewhere ?), thanks anyway for replying ;) > > Just for the record, the phase that stalls is fetcher during reduce: > > Jobid User Name Map % Complete Map Total Maps Completed > Reduce % > Complete Reduce Total Reduces Completed > job_200807151723_0005 hadoop fetch crawl-ecxi/segments/20080715172458 > 100.00% > 2 2 16.66% > > 1 0 > > It's stuck on 16%, no traffic, no crawling, but still "running". > > On Tue, Jul 15, 2008 at 4:28 PM, Patrick Markiewicz > <[EMAIL PROTECTED]> wrote: >> Hi brain, >> If I were you, I would download wireshark >> (http://www.wireshark.org/download.html) to see what is happening at the >> network layer and see if that provides any clues. A socket exception >> that you don't expect is usually due to one side of the conversation not >> understanding the other side. If you have 4 machines, then you have 4 >> possible places where default firewall rules could be causing an issue. >> If it is not the firewall rules, the NAT rules could be a potential >> source of error. Also, even a router hardware error could cause a >> problem. >> If you understand TCP, just make sure that you see all the >> correct TCP stuff happening in wireshark. If you don't understand >> wireshark's display, let me know, and I'll pass on some quickstart >> information. >> >> If you already know all of this, I don't have any way to help >> you, as it looks like you're trying to accomplish something trickier >> with nutch than I have ever attempted. >> >> Patrick >> >> -----Original Message----- >> From: brainstorm [mailto:[EMAIL PROTECTED] >> Sent: Tuesday, July 15, 2008 10:08 AM >> To: [email protected] >> Subject: Re: Distributed fetching only happening in one node ? >> >> Boiling down the problem I'm stuck on this: >> >> 2008-07-14 16:43:24,976 WARN dfs.DataNode - >> 192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to >> 192.168.0.252:50010 got java.net.SocketException: Connection reset >> at >> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96) >> at >> java.net.SocketOutputStream.write(SocketOutputStream.java:136) >> at >> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) >> at >> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) >> at java.io.DataOutputStream.write(DataOutputStream.java:90) >> at >> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602) >> at >> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636) >> at >> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391) >> at java.lang.Thread.run(Thread.java:595) >> >> Checked that firewall settings between node & frontend were not >> blocking packets, and they don't... anyone knows why is this ? If not, >> could you provide a convenient way to debug it ? >> >> Thanks ! >> >> On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <[EMAIL PROTECTED]> wrote: >>> Hi, >>> >>> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks >>> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the >>> best suited network topology for inet crawling (frontend being a net >>> bottleneck), but I think it's fine for testing purposes. >>> >>> I'm having issues with fetch mapreduce job: >>> >>> According to ganglia monitoring (network traffic), and hadoop >>> administrative interfaces, fetch phase is only being executed in the >>> frontend node, where I launched "nutch crawl". Previous nutch phases >>> were executed neatly distributed on all nodes: >>> >>> job_200807131223_0001 hadoop inject urls 100.00% >>> 2 2 100.00% >>> 1 1 >>> job_200807131223_0002 hadoop crawldb crawl-ecxi/crawldb >> 100.00% >>> 3 3 100.00% >>> 1 1 >>> job_200807131223_0003 hadoop generate: select >>> crawl-ecxi/segments/20080713123547 100.00% >>> 3 3 100.00% >>> 1 1 >>> job_200807131223_0004 hadoop generate: partition >>> crawl-ecxi/segments/20080713123547 100.00% >>> 4 4 100.00% >>> 2 2 >>> >>> I've checked that: >>> >>> 1) Nodes have inet connectivity, firewall settings >>> 2) There's enough space on local discs >>> 3) Proper processes are running on nodes >>> >>> frontend-node: >>> ========== >>> >>> [EMAIL PROTECTED] ~]# jps >>> 29232 NameNode >>> 29489 DataNode >>> 29860 JobTracker >>> 29778 SecondaryNameNode >>> 31122 Crawl >>> 30137 TaskTracker >>> 10989 Jps >>> 1818 TaskTracker$Child >>> >>> leaf nodes: >>> ======== >>> >>> [EMAIL PROTECTED] ~]# cluster-fork jps >>> compute-0-1: >>> 23929 Jps >>> 15568 TaskTracker >>> 15361 DataNode >>> compute-0-2: >>> 32272 TaskTracker >>> 32065 DataNode >>> 7197 Jps >>> 2397 TaskTracker$Child >>> compute-0-3: >>> 12054 DataNode >>> 19584 Jps >>> 14824 TaskTracker$Child >>> 12261 TaskTracker >>> >>> 4) Logs only show fetching process (taking place only in the head >> node): >>> >>> 2008-07-13 13:33:22,306 INFO fetcher.Fetcher - fetching >>> http://valleycycles.net/ >>> 2008-07-13 13:33:22,349 INFO api.RobotRulesParser - Couldn't get >>> robots.txt for http://www.getting-forward.org/: >>> java.net.UnknownHostException: www.getting-forward.org >>> 2008-07-13 13:33:22,349 INFO api.RobotRulesParser - Couldn't get >>> robots.txt for http://www.getting-forward.org/: >>> java.net.UnknownHostException: www.getting-forward.org >>> >>> What am I missing ? Why there are no fetching instances on nodes ? I >>> used the following custom script to launch a pristine crawl each time: >>> >>> #!/bin/sh >>> >>> # 1) Stops hadoop daemons >>> # 2) Overwrites new url list on HDFS >>> # 3) Starts hadoop daemons >>> # 4) Performs a clean crawl >>> >>> #export JAVA_HOME=/usr/lib/jvm/java-6-sun >>> export JAVA_HOME=/usr/java/jdk1.5.0_10 >>> >>> CRAWL_DIR=crawl-ecxi || $1 >>> URL_DIR=urls || $2 >>> >>> echo $CRAWL_DIR >>> echo $URL_DIR >>> >>> echo "Leaving safe mode..." >>> ./hadoop dfsadmin -safemode leave >>> >>> echo "Removing seed urls directory and previous crawled content..." >>> ./hadoop dfs -rmr $URL_DIR >>> ./hadoop dfs -rmr $CRAWL_DIR >>> >>> echo "Removing past logs" >>> >>> rm -rf ../logs/* >>> >>> echo "Uploading seed urls..." >>> ./hadoop dfs -put ../$URL_DIR $URL_DIR >>> >>> #echo "Entering safe mode..." >>> #./hadoop dfsadmin -safemode enter >>> >>> echo "******************" >>> echo "* STARTING CRAWL *" >>> echo "******************" >>> >>> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3 >>> >>> >>> Next step I'm thinking on to fix the problem is to install >>> nutch+hadoop as specified in this past nutch-user mail: >>> >>> http://www.mail-archive.com/[email protected]/msg10225.html >>> >>> As I don't know if it's current practice on trunk (archived mail is >>> from Wed, 02 Jan 2008), I wanted to ask if there's another way to fix >>> it or if it's being worked on by someone... I haven't found a matching >>> bug on JIRA :_/ >>> >> >
