Lewis John McGibbney created NUTCH-1908:
-------------------------------------------
Summary: HTMLMetaProcessor should be able to recognise and
retrieve metatags from <body>
Key: NUTCH-1908
URL: https://issues.apache.org/jira/browse/NUTCH-1908
Project: Nutch
Issue Type: Improvement
Components: parser
Affects Versions: 1.9, 2.3
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Fix For: 2.4, 1.10
I am regularly experiencing HTML authors who permit and publish <meta> tags
within the <body> of an (X)HTML document.
Right now, it would appear that the Nutch policy is to ignore such markup.
Evidence of this exists within HTMLMetaProcessor (0) as follows
{code}
private static final void getMetaTagsHelper(HTMLMetaTags metaTags, Node node,
URL currURL) {
if (node.getNodeType() == Node.ELEMENT_NODE) {
if ("body".equalsIgnoreCase(node.getNodeName())) {
// META tags should not be under body
return;
}
...
{code}
In a utopian WWW it would be OK to make the statement that 'META tags should
not be under body', however I am afraid that this is not always the case. It is
not a utopian WWW. An improvement in Nutch would therefore be for us to
recognize that HTML authors, or machines, do put <meta> tags into the <body>.
Over in Any23 and in crawler commons, we have taken the approach that, 'letting
people off' with having sh*tty markup is OK. I think in this case, this also
makes sense in Nutch.
I will implement a patch which permits explicit extraction of <meta> tags from
<body> as well as <head>
(0)
https://github.com/apache/nutch/blob/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/HTMLMetaProcessor.java#L56
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)