Hello,

I am having memory problems while trying to crawl a local website. I give Nutch 
1GB, but still can't finish the crawl. In order to solve this problem, I want 
to try to keep the segments size limited.

The sizes of produced segments vary. The first 3-5 levels are small. This is 
understandable. At this stage Nutch has not found too many URLs yet. But, as 
more pages are fetched and new URLs discovered, segment sizes grow. When they 
get too big, Nutch runs out of memory.

I am thinking of using the topN parameter to keep segment sizes limited. BUT, I 
don't want to loose any pages. The question is: if I limit the number of URLs 
fetched at a particular level, will I loose the URLs that were not selected for 
fetching at this level, or their fetching will just be postponed until there 
are no "better" candidates?

Any other recommendations on how to keep Nutch memory use to a minimum?


Thanks,

Arkadi
 

Reply via email to