John Mendenhall wrote:
Thank you in advance for any assistance you can
provide, or pointers at where I should look.

I am using nutch 0.9, with 1 master, 4 slaves.
I am crawling a single site with 1.4 million urls.

I am running the std generate/fetch/updatedb cycle
with topN at 100000.
It appears all 97 tasks get mapped.  Only one task
sees any action.
The one task crawls about 3% of my topN and stops
eventually with java.lang.OutOfMemoryError: Java heap space
errors.

Are you running Fetcher in parsing mode? Try to use the -noParsing option, and then parse the content in a separate step.



I believe I have two problems.  One is the heap space
issue.  The other is the mapping is not spreading out
all the urls to multiple map task slots.

What settings do I need to modify to get the generated
topN (100000) urls to be spread out amongst all map
task slots?

What is the host distribution of your fetchlist? I.e. how many unique hosts do you have among all the URLs in the fetchlist? If it's just 1 (or few) it could happen that they are mapped to a single map task. This is done on purpose - there is no central lock manager in Nutch / Hadoop, and Nutch needs a way to control the rate of access to any single host, for politeness reasons. Nutch can do this only if all urls from the same host are assigned to the same map task.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to