[
https://issues.apache.org/jira/browse/NUTCH-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655842#comment-17655842
]
Hudson commented on NUTCH-2634:
-------------------------------
FAILURE: Integrated in Jenkins build Nutch ยป Nutch-trunk #91 (See
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/91/])
NUTCH-2634 Some links marked as "nofollow" are followed anyway (snagel:
[https://github.com/apache/nutch/commit/dfdd00f3189839b6ed7d60651e5daa33f0038265])
* (edit)
src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestDOMContentUtils.java
* (edit)
src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/TestDOMContentUtils.java
* (edit)
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
* (edit)
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
> Some links marked as "nofollow" are followed anyway.
> ----------------------------------------------------
>
> Key: NUTCH-2634
> URL: https://issues.apache.org/jira/browse/NUTCH-2634
> Project: Nutch
> Issue Type: Bug
> Reporter: Gerard Bouchar
> Priority: Major
> Fix For: 1.20
>
>
> In order to check if an outlink in an <a> tag can be followed, nutch checks
> whether the value of its rel attribute is the exact string string "nofollow".
> However, [the rel attribute can contain a list of link
> types|https://html.spec.whatwg.org/multipage/links.html#attr-hyperlink-rel],
> all of which should be respected.
> So nutch rightfully doesn't follow a link like:
> {code:html}
> <a href='top-secret.html' rel="nofollow">DO NOT FOLLOW THIS LINK</a>
> {code}
> but wrongfully follows :
> {code:html}
> <a href='top-secret.html' rel="nofollow noreferrer">DO NOT FOLLOW THIS
> LINK</a>
> {code}
> Because of the code duplication in nutch's html parsers, this should be fixed
> in two places:
> #
> [parse/html/DOMContentUtils.java|https://github.com/apache/nutch/blob/3ada351a26b653b307c19e25b17e0e611a9bd59a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L437]
> #
> [parse/tika/DOMContentUtils.java|https://github.com/apache/nutch/blob/f02110f42c53e77450835776cf41f22c23f030ec/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java#L410]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)