djames wrote:
Hi all,

I got a probleme with parser when i try to crawl 2000 site with a depth of
3.
I use nutch 0.81 version and my setup worked well with other site but this
list gave me this error:

2007-06-06 13:49:27,997 WARN  mapred.LocalJobRunner - job_qsjobz
java.lang.StackOverflowError
        at org.apache.xerces.dom.ParentNode.getLength(Unknown Source)
        at
org.apache.nutch.parse.html.DOMContentUtils.getOutlinks(DOMContentUtils.java:305)

I've seen this on some occasions, but I haven't discovered the real reason for this error yet - for now I suggest that you modify the source of DOMContentUtils to artificially limit the level of recursion in getOutlinks to something like 200-300.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to