Hello, I'm currently using Nutch to crawl a small selection of websites. All together, there are about 60.000 HTML pages and about 10.000 PDFs, which shall all go into the index. I'm using a custom crawl class, which repeats the crawl iterations not based on a given number (depth) but instead repeating until there are no more unfetched pages (quite similar to a Python script posted some weeks ago). This is ok, as most of these don't change that often, so I could do such a long initial crawl, and because of URL filtering etc. I know it's going to be a definite number of pages.
When I run this over the weekend :) I always get a OutOfMemoryException (permgen space) after about 400 iterations, during which about 40.000 pages get indexed. Increasing permgen space for the JVM only increased the time until I get the error. So I ran my Crawl feeding it with only a couple of URLs so that it's done after about 30 minutes, and did some profiling with JDK6' jconsole, jmap and jhat. I discovered in jconsole that after each iteration permgen increases about 1MB as more classes are loaded. I'm comparing a jmap dump after about 5 iterations, and another after about 30 iterations, and discover the following: * permgen usage has almost doubled * none of my custom plugins have more than 1 instance * looking for nutch classes, I see that org.apache.nutch.plugin.Extension and org.apache.nutch.plugin.PluginDescriptor have both gone from about 500 to about 3000 instances, org.apache.nutch.plugin.ExtensionPoint from 1500 to 1000 instances. * looking further, I see that org.apache.nutch.plugin.PluginRepository and org.apache.hadoop.mapred.JobConf have both increased from 16 to 99 instances. Now I'm wondering if that behavior is intended. It seems there are 3 or 4 instances of JobConf created during each loop (in Generator.generate, in Fetcher.fetch, in ParseSegment.parse and in CrawlDb.update). Is that really necessary? I pass my JobConf in the constructor already. What could I do to run the crawl fully? Increase permgen even more or is there anything else? Thanks for reading, RĂ¼diger -- View this message in context: http://www.nabble.com/Memory-leakduring-crawlr--tf3328411.html#a9254396 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
