[
https://issues.apache.org/jira/browse/NUTCH-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16262912#comment-16262912
]
Cass Pallansch commented on NUTCH-2464:
---------------------------------------
That would be great! It turns out that the team I am helping with this issue
did notify me of the fact that they are using v1.13 and not the v2.x line of
code for Nutch. If this could be corrected in the latest v1.x as well it would
help solve a significant problem that we are trying to overcome.
> Headers That Contain HTML Elements Are Not Parsed
> -------------------------------------------------
>
> Key: NUTCH-2464
> URL: https://issues.apache.org/jira/browse/NUTCH-2464
> Project: Nutch
> Issue Type: Bug
> Components: plugin
> Affects Versions: 2.3
> Environment: Internal development/test environments.
> Reporter: Cass Pallansch
> Attachments: NUTCH-2464-complex-header.html
>
>
> Nutch does not appear to traverse the HTML elements that may be contained
> within header elements (e.g., H1, H2, H3, etc. tags). Many times there are
> anchors and/or <span> tags within these elements that contain the actual text
> nodes that should be picked up as the header value for indexing purposes.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)