[ https://issues.apache.org/jira/browse/NUTCH-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gerard Bouchar updated NUTCH-2634: ---------------------------------- Description: In order to check if an outlink in an <a> tag can be followed, nutch checks whether the value of its rel attribute is the exact string string "nofollow". However, [the rel attribute can contain a list of link types|https://html.spec.whatwg.org/multipage/links.html#attr-hyperlink-rel], all of which should be respected. So nutch rightfully doesn't follow a link like: {code:html} <a href='top-secret.html' rel="nofollow">DO NOT FOLLOW THIS LINK</a> {code} but wrongfully follows : {code:html} <a href='top-secret.html' rel="nofollow noreferrer">DO NOT FOLLOW THIS LINK</a> {code} Because of the code duplication in nutch's html parsers, this should be fixed in two places: # [parse/html/DOMContentUtils.java|https://github.com/apache/nutch/blob/3ada351a26b653b307c19e25b17e0e611a9bd59a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L437] # [parse/tika/DOMContentUtils.java|https://github.com/apache/nutch/blob/f02110f42c53e77450835776cf41f22c23f030ec/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java#L410] was: In order to check if an outlink in an <a> tag can be followed, nutch checks whether the value of its rel attribute is the exact string string "nofollow". However, the rel attribute can contain a list of link types, all of which should be respected. So nutch rightfully doesn't follow a link like: {code:html} <a href='top-secret.html' rel="nofollow">DO NOT FOLLOW THIS LINK</a> {code} but wrongfully follows : {code:html} <a href='top-secret.html' rel="nofollow noreferrer">DO NOT FOLLOW THIS LINK</a> {code} Because of the code duplication in nutch's html parsers, this should be fixed in two places: # [parse/html/DOMContentUtils.java|https://github.com/apache/nutch/blob/3ada351a26b653b307c19e25b17e0e611a9bd59a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L437] # [parse/tika/DOMContentUtils.java|https://github.com/apache/nutch/blob/f02110f42c53e77450835776cf41f22c23f030ec/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java#L410] > Some links marked as "nofollow" are followed anyway. > ---------------------------------------------------- > > Key: NUTCH-2634 > URL: https://issues.apache.org/jira/browse/NUTCH-2634 > Project: Nutch > Issue Type: Bug > Reporter: Gerard Bouchar > Priority: Major > > In order to check if an outlink in an <a> tag can be followed, nutch checks > whether the value of its rel attribute is the exact string string "nofollow". > However, [the rel attribute can contain a list of link > types|https://html.spec.whatwg.org/multipage/links.html#attr-hyperlink-rel], > all of which should be respected. > So nutch rightfully doesn't follow a link like: > {code:html} > <a href='top-secret.html' rel="nofollow">DO NOT FOLLOW THIS LINK</a> > {code} > but wrongfully follows : > {code:html} > <a href='top-secret.html' rel="nofollow noreferrer">DO NOT FOLLOW THIS > LINK</a> > {code} > Because of the code duplication in nutch's html parsers, this should be fixed > in two places: > # > [parse/html/DOMContentUtils.java|https://github.com/apache/nutch/blob/3ada351a26b653b307c19e25b17e0e611a9bd59a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L437] > # > [parse/tika/DOMContentUtils.java|https://github.com/apache/nutch/blob/f02110f42c53e77450835776cf41f22c23f030ec/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java#L410] -- This message was sent by Atlassian JIRA (v7.6.3#76005)