[ https://issues.apache.org/jira/browse/NUTCH-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dennis Kubes closed NUTCH-555. ------------------------------ Assignee: Dennis Kubes Issue closed, fixed by NUTCH-497 > StackOverflowError in DomContentUtils > ------------------------------------- > > Key: NUTCH-555 > URL: https://issues.apache.org/jira/browse/NUTCH-555 > Project: Nutch > Issue Type: Bug > Affects Versions: 0.9.0 > Reporter: Karsten Dello > Assignee: Dennis Kubes > Attachments: readseg.txt, stacktrace.txt > > > Parsing certain pages (which expose very bad html) causes an stackoverflow > error, as the recursion depth is too high (more then 1000). > But parsing should be stable, it is probably better to just skip pages like > this. > Attached it > a) the stacktrace > b) the segmentreader-get output for the url where the exception is thrown > Possible fixes: > parseOutlinks in DomContentUtils is implemented recursive. > An iterative implementation would fix this, but maybe it is easier to simply > limit the recursion to a reasonable depth. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.