Hi, I have a set of 1,00,000 urls that I am trying to crawl and index. I have heap memory size for child tasktrackers set to 512MB. I have disabled pdf and doc parsing currently. I am running this on Nutch-0.8 with 1 RHEL node with depth to set to 1.
I get this OutOfMemoryException for Java Heap Space while running the parse job. The parse_data directory doesnt exist at any time during the job execution. Despite several re-runs, I get the same exception repeatedly. I re-ran the crawling for 20,000 urls and the entire thing runs fine. Is Nutch known to fail with large sets of urls ? Is there a patch available or am I missing something. Thanks, Manav -- View this message in context: http://www.nabble.com/OutOfMemory-Exception-in-parsing-tp22178719p22178719.html Sent from the Nutch - User mailing list archive at Nabble.com.
