[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dennis Kubes updated NUTCH-497: ------------------------------- Attachment: nested-tags-trap3.patch Adds a utility class called NodeWalker which allows a generic framework for stack based walking of Node trees. This framework is then applied to DomContentUtils and HtmlLanguageParser reworking functionality that used to be handled by recursion. The patch file is nested-tags-trap3.patch > Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider > Trap > ---------------------------------------------------------------------------------- > > Key: NUTCH-497 > URL: https://issues.apache.org/jira/browse/NUTCH-497 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.8.1, 0.9.0, 1.0.0 > Environment: all > Reporter: Dennis Kubes > Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, > nested-tags-trap2.patch, nested-tags-trap3.patch > > > Some webpages have a form of a spider trap that causes a > StackOverflowException in DomContentUtils by having nested tags with > thousands of layers deep. DomContentUtils when trying to get outlinks uses a > recursive method to parse the html. With this type of nesting it errors out. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers