John Mendenhall wrote:
On Fri, 25 Jan 2008, Andrzej Bialecki wrote:
I am using nutch 0.9, with 1 master, 4 slaves.
I am crawling a single site with 1.4 million urls.
I am running the std generate/fetch/updatedb cycle
with topN at 100000.
It appears all 97 tasks get mapped. Only one task
sees any action.
The one task crawls about 3% of my topN and stops
eventually with java.lang.OutOfMemoryError: Java heap space
errors.
Are you running Fetcher in parsing mode? Try to use the -noParsing
option, and then parse the content in a separate step.
I am running fetcher in parsing mode.
Is this possibly taking up too much memory?
Is that most likely the problem?
Yes that is most likely the problem.
Is the recommendation to run fetcher in parsing mode?
If so, when should the parse be done? After the updatedb?
Before the indexing?
You would run the parsing after the fetch process. But this way the
fetch would complete the download and if the parsing failed you would
still have the page content and be able to try again without refetching.
What settings do I need to modify to get the generated
topN (100000) urls to be spread out amongst all map
task slots?
What is the host distribution of your fetchlist? I.e. how many unique
hosts do you have among all the URLs in the fetchlist? If it's just 1
(or few) it could happen that they are mapped to a single map task. This
is done on purpose - there is no central lock manager in Nutch / Hadoop,
and Nutch needs a way to control the rate of access to any single
host, for politeness reasons. Nutch can do this only if all urls from
the same host are assigned to the same map task.
All hosts are the same. Everyone of them.
If there is no way to split them up, this seems to
imply the distributed nature of nutch is lost on
attempting to build an index for a single large
site. Please correct me if I am wrong with this
presumption.
Thanks!
JohnM