[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506775 ]
Andrzej Bialecki commented on NUTCH-497: ----------------------------------------- The patch looks good to me as it is now - however, I've seen similar issues with getTextHelper, too, or for that matter with any other DOM tree traversal present in Nutch (all other places in DOMContentUtils, HTMLMetaTags, CCParseFilter and HtmlLanguageParser). We can apply the patch as is, but it would be good to come up with a general method of stack-based DOM traveral, so that we can use it in other places, too. > Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider > Trap > ---------------------------------------------------------------------------------- > > Key: NUTCH-497 > URL: https://issues.apache.org/jira/browse/NUTCH-497 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.8.1, 0.9.0, 1.0.0 > Environment: all > Reporter: Dennis Kubes > Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, > nested-tags-trap2.patch > > > Some webpages have a form of a spider trap that causes a > StackOverflowException in DomContentUtils by having nested tags with > thousands of layers deep. DomContentUtils when trying to get outlinks uses a > recursive method to parse the html. With this type of nesting it errors out. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers