Hi,

I want to use Nutch in EC2 to crawl around 100 million URLs, extracting only
questions and answers from http://answers.yahoo.com. I'm a Nutch newbie so
apologies for any basic queries, I've the following questions:
1. I chose to use the individual fetch, generate, updatedb etc. CLI over the
single crawl CLI, because the former supports incremental crawling, and I
can crawl the entire http://answers.yahoo.com site. Is my understanding
correct?
2. As of now, I've setup a crawl on a single large Ubuntu instance attached
to a 10 GB EBS with the release binaries of Nutch 1.0 to get an
understanding of Nutch before I delve into Hadoop to scale out. The Nutch
binaries and /crawl folders are setup on the EBS. However, each time I setup
a crawl in EC2, it hangs randomly at some URL when fetching, this is the
biggest roadblock. I'm unable to figure out what's going on, should my Nutch
configuration change or is this because of some EC2 trait? I'm stuck up at
this point for quite some days now and cannot proceed any futher, how do I
go about figuring out what's wrong?
3. The earlier crawls I setup were with -topN 10000 and -topN 100000. I
hoped to see huge number of links getting crawled, but the readdb stats
showed a mere TOTAL urls: 8908, out of which db_unfetched were 7848 and the
db_fetched were 680. So, I omitted the -topN from the 'bin/nutch generate'
command so that it will allow for limitless crawling, is this understanding
correct?

Details of my configuration:
> cat seed/urls
http://answers.yahoo.com
http://answers.yahoo.com/dir/index

nutch-site.xml:
--------------
<property>
 <name>http.agent.name</name>
  <value>VidyaMN</value>
  <description>Vidya M N
  </description>
</property>
<property>
  <name>http.agent.description</name>
  <value>Nutch spiderman</value>
  <description> Nutch spiderman
  </description>
</property>
<property>
  <name>http.agent.url</name>
  <value>myurl</value>
  <description>
crawl yahoo answers
  </description>
</property>
<property>
  <name>http.agent.email</name>
  <value>myemailid</value>
  <description>my email address
  </description>
<property>
  <name>http.verbose</name>
  <value>true</value>
  <description>If true, HTTP will log more verbosely.</description>
</property>
<property>
  <name>fetcher.verbose</name>
  <value>true</value>
  <description>If true, fetcher will log more verbosely.</description>
</property>
<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>

All the other configurations are the default.

crawl-urlfilter.txt:
-------------------
# skip URLs containing certain characters as probable queries, etc.
-...@]
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*answers.yahoo.com/

regex-urlfilter.txt:
-------------------
# skip URLs containing certain characters as probable queries, etc.
-...@]
-http://answers.yahoo.com/my/*
-http://answers.yahoo.com/info/*

(Note that, I've removed '?' and '=' to allow crawling links of the form
'http://answers.yahoo.com/question/index?qid=20091120014022AAsbXCf')

Tail of the most recent crawl.log:
-------------------------------
fetching
http://answers.yahoo.com/answer/report?qid=20091119191349AAQC5Yz&kid=PJguAki7KW5GdTnxOQrw&.crumb=
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=500
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=500
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=500
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=500
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=500
fetching
http://answers.yahoo.com/question/report?qid=20091119160525AArWFBa&kid=N7tuLk_KAFOWQFhDuLC5&date=2009-11-19+16%3A05%3A25&.crumb=&s=q

Tail of the most recent hadoop.log:
---------------------------------
2009-11-22 06:28:28,251 INFO  fetcher.Fetcher - fetching 
http://answers.yahoo.com/question/report?qid=20091119160525AArWFBa&kid=N7tuLk_KAFOWQFhDuLC5&date=2009-11-19+16%3A05%3A25&.crumb=&s=q

Any help/inputs would be greatly appreciated, thanks in advance.

Thanks and Regards,
Vidya.

-- 
View this message in context: 
http://old.nabble.com/Nutch-whole-web-crawl-in-EC2-hangs-and-fetches-few-URLs-tp26463891p26463891.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to