Hi, On 5/25/07, Bolle, Jeffrey F. <[EMAIL PROTECTED]> wrote:
Is there a good explanation someone can point me to as to why when I setup a hadoop cluster my entire site isn't crawled? It doesn't make sense that I should have to tweak the number of hadoop map and reduce tasks in order to ensure that everything gets indexed.
And you shouldn't. Number of map and reduce tasks may affect crawling speed but doesn't affect number of crawled urls.
I followed the tutorial here: http://wiki.apache.org/nutch/NutchHadoopTutorial and have found that only a small portion of my site was indexed. Besides explicitly stating every URL on the site, what should I do to ensure that my hadoop cluster (of only 4 machines) manages to create a full index?
Does it work on a single machine? If it does, then this is very weird. Here are a couple of things to try: * After injecting urls, do a readdb -stats to count the number of injected urls. * After generating, do a readseg -list <segment> to count the number of generated urls. * If the number of urls in your segment is correct, then during fetching check out the number of successfully fetched urls in web UI. (perhaps, cluster machines can't fetch those urls?)
Thanks for the help. Jeff
-- Doğacan Güney
