[
https://issues.apache.org/jira/browse/NUTCH-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14272406#comment-14272406
]
Lewis John McGibbney commented on NUTCH-1908:
---------------------------------------------
This issue is pretty well described here
http://stackoverflow.com/questions/1447842/what-happens-if-the-meta-tags-are-present-in-the-document-body
> HTMLMetaProcessor should be able to recognise and retrieve metatags from
> <body>
> -------------------------------------------------------------------------------
>
> Key: NUTCH-1908
> URL: https://issues.apache.org/jira/browse/NUTCH-1908
> Project: Nutch
> Issue Type: Improvement
> Components: parser
> Affects Versions: 2.3, 1.9
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Fix For: 2.4, 1.10
>
>
> I am regularly experiencing HTML authors who permit and publish <meta> tags
> within the <body> of an (X)HTML document.
> Right now, it would appear that the Nutch policy is to ignore such markup.
> Evidence of this exists within HTMLMetaProcessor (0) as follows
> {code}
> private static final void getMetaTagsHelper(HTMLMetaTags metaTags, Node
> node, URL currURL) {
> if (node.getNodeType() == Node.ELEMENT_NODE) {
> if ("body".equalsIgnoreCase(node.getNodeName())) {
> // META tags should not be under body
> return;
> }
> ...
> {code}
> In a utopian WWW it would be OK to make the statement that 'META tags should
> not be under body', however I am afraid that this is not always the case. It
> is not a utopian WWW. An improvement in Nutch would therefore be for us to
> recognize that HTML authors, or machines, do put <meta> tags into the <body>.
> Over in Any23 and in crawler commons, we have taken the approach that,
> 'letting people off' with having sh*tty markup is OK. I think in this case,
> this also makes sense in Nutch.
> I will implement a patch which permits explicit extraction of <meta> tags
> from <body> as well as <head>
> (0)
> https://github.com/apache/nutch/blob/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/HTMLMetaProcessor.java#L56
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)