Lewis John McGibbney created NUTCH-1908:
-------------------------------------------

             Summary: HTMLMetaProcessor should be able to recognise and 
retrieve metatags from <body>
                 Key: NUTCH-1908
                 URL: https://issues.apache.org/jira/browse/NUTCH-1908
             Project: Nutch
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.9, 2.3
            Reporter: Lewis John McGibbney
            Assignee: Lewis John McGibbney
             Fix For: 2.4, 1.10


I am regularly experiencing HTML authors who permit and publish <meta> tags 
within the <body> of an (X)HTML document.
Right now, it would appear that the Nutch policy is to ignore such markup.
Evidence of this exists within HTMLMetaProcessor (0) as follows

{code}
 private static final void getMetaTagsHelper(HTMLMetaTags metaTags, Node node, 
URL currURL) {
    if (node.getNodeType() == Node.ELEMENT_NODE) {
    if ("body".equalsIgnoreCase(node.getNodeName())) {
      // META tags should not be under body
      return;
    }
...
{code}

In a utopian WWW it would be OK to make the statement that 'META tags should 
not be under body', however I am afraid that this is not always the case. It is 
not a utopian WWW. An improvement in Nutch would therefore be for us to 
recognize that HTML authors, or machines, do put <meta> tags into the <body>.

Over in Any23 and in crawler commons, we have taken the approach that, 'letting 
people off' with having sh*tty markup is OK. I think in this case, this also 
makes sense in Nutch.

I will implement a patch which permits explicit extraction of <meta> tags from 
<body> as well as <head>

(0) 
https://github.com/apache/nutch/blob/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/HTMLMetaProcessor.java#L56



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to