[ 
https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506775
 ] 

Andrzej Bialecki  commented on NUTCH-497:
-----------------------------------------

The patch looks good to me as it is now - however, I've seen similar issues 
with getTextHelper, too, or for that matter with any other DOM tree traversal 
present in Nutch (all other places in DOMContentUtils, HTMLMetaTags, 
CCParseFilter and HtmlLanguageParser).

We can apply the patch as is, but it would be good to come up with a general 
method of stack-based DOM traveral, so that we can use it in other places, too.

> Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider 
> Trap
> ----------------------------------------------------------------------------------
>
>                 Key: NUTCH-497
>                 URL: https://issues.apache.org/jira/browse/NUTCH-497
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0, 1.0.0
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: ExtremeNestedTags.patch, nested-tags-trap.patch, 
> nested-tags-trap2.patch
>
>
> Some webpages have a form of a spider trap that causes a 
> StackOverflowException in DomContentUtils by having nested tags with 
> thousands of layers deep.  DomContentUtils when trying to get outlinks uses a 
> recursive method to parse the html.  With this type of nesting it errors out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to