Hi, I want to use Nutch in EC2 to crawl around 100 million URLs, extracting only questions and answers from http://answers.yahoo.com. I'm a Nutch newbie so apologies for any basic queries, I've the following questions: 1. I chose to use the individual fetch, generate, updatedb etc. CLI over the single crawl CLI, because the former supports incremental crawling, and I can crawl the entire http://answers.yahoo.com site. Is my understanding correct? 2. As of now, I've setup a crawl on a single large Ubuntu instance attached to a 10 GB EBS with the release binaries of Nutch 1.0 to get an understanding of Nutch before I delve into Hadoop to scale out. The Nutch binaries and /crawl folders are setup on the EBS. However, each time I setup a crawl in EC2, it hangs randomly at some URL when fetching, this is the biggest roadblock. I'm unable to figure out what's going on, should my Nutch configuration change or is this because of some EC2 trait? I'm stuck up at this point for quite some days now and cannot proceed any futher, how do I go about figuring out what's wrong? 3. The earlier crawls I setup were with -topN 10000 and -topN 100000. I hoped to see huge number of links getting crawled, but the readdb stats showed a mere TOTAL urls: 8908, out of which db_unfetched were 7848 and the db_fetched were 680. So, I omitted the -topN from the 'bin/nutch generate' command so that it will allow for limitless crawling, is this understanding correct?
Details of my configuration: > cat seed/urls http://answers.yahoo.com http://answers.yahoo.com/dir/index nutch-site.xml: -------------- <property> <name>http.agent.name</name> <value>VidyaMN</value> <description>Vidya M N </description> </property> <property> <name>http.agent.description</name> <value>Nutch spiderman</value> <description> Nutch spiderman </description> </property> <property> <name>http.agent.url</name> <value>myurl</value> <description> crawl yahoo answers </description> </property> <property> <name>http.agent.email</name> <value>myemailid</value> <description>my email address </description> <property> <name>http.verbose</name> <value>true</value> <description>If true, HTTP will log more verbosely.</description> </property> <property> <name>fetcher.verbose</name> <value>true</value> <description>If true, fetcher will log more verbosely.</description> </property> <property> <name>db.ignore.external.links</name> <value>true</value> <description>If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. </description> </property> All the other configurations are the default. crawl-urlfilter.txt: ------------------- # skip URLs containing certain characters as probable queries, etc. -...@] # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*answers.yahoo.com/ regex-urlfilter.txt: ------------------- # skip URLs containing certain characters as probable queries, etc. -...@] -http://answers.yahoo.com/my/* -http://answers.yahoo.com/info/* (Note that, I've removed '?' and '=' to allow crawling links of the form 'http://answers.yahoo.com/question/index?qid=20091120014022AAsbXCf') Tail of the most recent crawl.log: ------------------------------- fetching http://answers.yahoo.com/answer/report?qid=20091119191349AAQC5Yz&kid=PJguAki7KW5GdTnxOQrw&.crumb= -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=500 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=500 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=500 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=500 -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=500 fetching http://answers.yahoo.com/question/report?qid=20091119160525AArWFBa&kid=N7tuLk_KAFOWQFhDuLC5&date=2009-11-19+16%3A05%3A25&.crumb=&s=q Tail of the most recent hadoop.log: --------------------------------- 2009-11-22 06:28:28,251 INFO fetcher.Fetcher - fetching http://answers.yahoo.com/question/report?qid=20091119160525AArWFBa&kid=N7tuLk_KAFOWQFhDuLC5&date=2009-11-19+16%3A05%3A25&.crumb=&s=q Any help/inputs would be greatly appreciated, thanks in advance. Thanks and Regards, Vidya. -- View this message in context: http://old.nabble.com/Nutch-whole-web-crawl-in-EC2-hangs-and-fetches-few-URLs-tp26463891p26463891.html Sent from the Nutch - User mailing list archive at Nabble.com.