Felix Zett created NUTCH-2720:
---------------------------------
Summary: ROBOTS metatag ignored when capitalized
Key: NUTCH-2720
URL: https://issues.apache.org/jira/browse/NUTCH-2720
Project: Nutch
Issue Type: Bug
Components: indexer, robots
Affects Versions: 1.15
Reporter: Felix Zett
Attachments: noindex.html
As discussed [on the mailing
list|https://www.mail-archive.com/[email protected]/msg16516.html],
index-metadata fails to ignore a webpage with a capitalized robots metatag such
as {{<META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">}}. This only applies when
parse-tika is used. parse-html will "decapitalize"
Parsing the attached [^noindex.html] leads to the following results:
*parse-html:*
{code:java}
bin/nutch parsechecker
-Dplugin.includes="protocol-httpclient|parse-(html|metatags)|index-metadata"
-Dindexer.delete.robots.noindex="true" -Dmetatags.names="robots"
-Dindex.parse.md="metatag.robots" http://localhost:8080/noindex.html
Parse Metadata: [...] metatag.robots=noindex,nofollow
robots=noindex,nofollow{code}
*parse-tika:*
{code:java}
bin/nutch parsechecker
-Dplugin.includes="protocol-httpclient|parse-(tika|metatags)|index-metadata"
-Dindexer.delete.robots.noindex="true" -Dmetatags.names="robots"
-Dindex.parse.md="metatag.robots" http://localhost:8080/noindex.html
Parse Metadata: metatag.robots=NOINDEX,NOFOLLOW [...] ROBOTS=NOINDEX,NOFOLLOW
[...]{code}
The field being named "ROBOTS" and not "robots" leads to
{{parseData.getMeta("robots")}} being {{null}} in
[https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L257].
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)