Gerard Bouchar created NUTCH-2634:
-------------------------------------
Summary: Some links marked as "nofollow" are followed anyway.
Key: NUTCH-2634
URL: https://issues.apache.org/jira/browse/NUTCH-2634
Project: Nutch
Issue Type: Bug
Reporter: Gerard Bouchar
In order to check if an outlink in an <a> tag can be followed, nutch checks
whether the value of its rel attribute is the exact string string "nofollow".
However, the rel attribute can contain a list of link types, all of which
should be respected.
So nutch rightfully doesn't follow a link like:
{code:html}
<a href='top-secret.html' rel="nofollow">DO NOT FOLLOW THIS LINK</a>
{code}
but wrongfully follows :
{code:html}
<a href='top-secret.html' rel="nofollow noreferrer">DO NOT FOLLOW THIS LINK</a>
{code}
Because of the code duplication in nutch's html parsers, this should be fixed
in two places:
#
[parse/html/DOMContentUtils.java|https://github.com/apache/nutch/blob/3ada351a26b653b307c19e25b17e0e611a9bd59a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L437]
#
[parse/tika/DOMContentUtils.java|https://github.com/apache/nutch/blob/f02110f42c53e77450835776cf41f22c23f030ec/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java#L410]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)