On Fri, 25 Jan 2008, Andrzej Bialecki wrote: > >I am using nutch 0.9, with 1 master, 4 slaves. > >I am crawling a single site with 1.4 million urls. > > > >I am running the std generate/fetch/updatedb cycle > >with topN at 100000. > >It appears all 97 tasks get mapped. Only one task > >sees any action. > >The one task crawls about 3% of my topN and stops > >eventually with java.lang.OutOfMemoryError: Java heap space > >errors. > > Are you running Fetcher in parsing mode? Try to use the -noParsing > option, and then parse the content in a separate step.
I am running fetcher in parsing mode. Is this possibly taking up too much memory? Is that most likely the problem? Is the recommendation to run fetcher in parsing mode? If so, when should the parse be done? After the updatedb? Before the indexing? > >What settings do I need to modify to get the generated > >topN (100000) urls to be spread out amongst all map > >task slots? > > What is the host distribution of your fetchlist? I.e. how many unique > hosts do you have among all the URLs in the fetchlist? If it's just 1 > (or few) it could happen that they are mapped to a single map task. This > is done on purpose - there is no central lock manager in Nutch / Hadoop, > and Nutch needs a way to control the rate of access to any single > host, for politeness reasons. Nutch can do this only if all urls from > the same host are assigned to the same map task. All hosts are the same. Everyone of them. If there is no way to split them up, this seems to imply the distributed nature of nutch is lost on attempting to build an index for a single large site. Please correct me if I am wrong with this presumption. Thanks! JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services
