djames wrote:
Hi all,I got a probleme with parser when i try to crawl 2000 site with a depth of 3. I use nutch 0.81 version and my setup worked well with other site but this list gave me this error: 2007-06-06 13:49:27,997 WARN mapred.LocalJobRunner - job_qsjobz java.lang.StackOverflowError at org.apache.xerces.dom.ParentNode.getLength(Unknown Source) at org.apache.nutch.parse.html.DOMContentUtils.getOutlinks(DOMContentUtils.java:305)
I've seen this on some occasions, but I haven't discovered the real reason for this error yet - for now I suggest that you modify the source of DOMContentUtils to artificially limit the level of recursion in getOutlinks to something like 200-300.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
