[
https://issues.apache.org/jira/browse/NUTCH-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16785836#comment-16785836
]
Sebastian Nagel commented on NUTCH-2683:
----------------------------------------
Any comments or objections? Thanks! Otherwise I'll commit.
> DeduplicationJob: add option to prefer https:// over http://
> ------------------------------------------------------------
>
> Key: NUTCH-2683
> URL: https://issues.apache.org/jira/browse/NUTCH-2683
> Project: Nutch
> Issue Type: Improvement
> Components: crawldb
> Affects Versions: 1.15
> Reporter: Sebastian Nagel
> Priority: Major
> Fix For: 1.16
>
>
> The deduplication job allows to keep the shortest URLs as the "best" URL of a
> set of duplicates, marking all longer ones as duplicates. Recently search
> engines started to penalize non-https pages by [giving https pages a higher
> rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html]
> and [marking http as
> insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/].
> If URLs are identical except for the protocol the deduplication job should be
> able to prefer https:// over http:// URLs, although the latter ones are
> shorter by one character. Of course, this should be configurable and in
> addition to existing preferences (length, score and fetch time) to select the
> "best" URL among duplicates.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)