[ 
https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964202#comment-13964202
 ] 

Sebastian Nagel commented on NUTCH-710:
---------------------------------------

Thanks,  [~Sertac Turkel]! My comments:
* every page containing a canonical link is now rejected. That's a rather hard 
decision. It should be configurable whether pages containing correct 
(non-empty, not self-referential, etc.) canonical links
*# are unconditionally rejected
*# are removed later only if the target is indexed. It's close to 
deduplication, and it's what canonical links are intended for: give web masters 
a chance to support and influence deduplication.
*# are only recorded (as outlinks and/or as indexed fields)
This point is the most challenging one: you need to take care for all nasty 
situations "in the wild", e.g. a canonical link pointing to a redirect which 
leads you back to the current page, etc. It's required to "resolve" chains of 
canonical links in combination with redirects, see Julien's comment and 
[1|http://mail-archives.apache.org/mod_mbox/nutch-user/201203.mbox/%3CCA+-fM0sg=rvuNxzoez5NLFmhNJHta=qp5qhtfrj8ii55fb2...@mail.gmail.com%3E].
* is it really necessary to handle canonical links explicitely in 
DbUpdateMapper and mark as injected? Couldn't this be done by adding them 
simply as outlinks? Per default links of "link" elements are added as outlinks, 
cf. parser.html.outlinks.ignore_tags. Of course, canonical links should be 
added even if "link" elements are ignored.
* extraction of canonical links: at least, the following points are missing: 
relative URLs, and canonical link inside HTTP headers (required for anything 
which is not HTML). I'll try support you in this point because there's already 
some work done.
* keep names in parallel?
{code}src/plugin/parse-html/.../TestDOMContentUtils.java
src/plugin/parse-tika/.../DOMContentUtilsTest.java
{code}

... and some useful references:
[http://en.wikipedia.org/wiki/Canonical_link_element]
[http://tools.ietf.org/html/rfc6596]
[https://support.google.com/webmasters/answer/139066]
[http://www.mattcutts.com/blog/rel-canonical-html-head/]
[http://googlewebmastercentral.blogspot.de/2011/06/supporting-relcanonical-http-headers.html]


> Support for rel="canonical" attribute
> -------------------------------------
>
>                 Key: NUTCH-710
>                 URL: https://issues.apache.org/jira/browse/NUTCH-710
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.1
>            Reporter: Frank McCown
>            Priority: Minor
>             Fix For: 1.9
>
>         Attachments: NUTCH-710.patch, canonical.patch
>
>
> There is a the new rel="canonical" attribute which is
> now being supported by Google, Yahoo, and Live:
> http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
> Adding support for this attribute value will potentially reduce the number of 
> URLs crawled and indexed and reduce duplicate page content.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to