Re: Clustered crawl

Doğacan Güney Fri, 25 May 2007 07:13:52 -0700

Hi,

On 5/25/07, Bolle, Jeffrey F. <[EMAIL PROTECTED]> wrote:

Is there a good explanation someone can point me to as to why when I
setup a hadoop cluster my entire site isn't crawled?  It doesn't make
sense that I should have to tweak the number of hadoop map and reduce
tasks in order to ensure that everything gets indexed.


And you shouldn't. Number of map and reduce tasks may affect crawling
speed but doesn't affect number of crawled urls.


I followed the tutorial here:
http://wiki.apache.org/nutch/NutchHadoopTutorial and have found that
only a small portion of my site was indexed.  Besides explicitly
stating every URL on the site, what should I do to ensure that my
hadoop cluster (of only 4 machines) manages to create a full index?


Does it work on a single machine? If it does, then this is very weird.

Here are a couple of things to try:
* After injecting urls, do a readdb -stats to count the number of injected urls.
* After generating, do a readseg -list <segment> to count the number
of generated urls.
* If the number of urls in your segment is correct, then during
fetching check out the number of successfully fetched urls in web UI.
(perhaps, cluster machines can't fetch those urls?)


Thanks for the help.

Jeff



--
Doğacan Güney

Re: Clustered crawl

Reply via email to