Hello,
I am having memory problems while trying to crawl a local website. I give Nutch 1GB, but still can't finish the crawl. In order to solve this problem, I want to try to keep the segments size limited. The sizes of produced segments vary. The first 3-5 levels are small. This is understandable. At this stage Nutch has not found too many URLs yet. But, as more pages are fetched and new URLs discovered, segment sizes grow. When they get too big, Nutch runs out of memory. I am thinking of using the topN parameter to keep segment sizes limited. BUT, I don't want to loose any pages. The question is: if I limit the number of URLs fetched at a particular level, will I loose the URLs that were not selected for fetching at this level, or their fetching will just be postponed until there are no "better" candidates? Any other recommendations on how to keep Nutch memory use to a minimum? Thanks, Arkadi
