[
https://issues.apache.org/jira/browse/NUTCH-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16814387#comment-16814387
]
Hudson commented on NUTCH-2683:
-------------------------------
FAILURE: Integrated in Jenkins build Nutch-trunk #3619 (See
[https://builds.apache.org/job/Nutch-trunk/3619/])
NUTCH-2683 DeduplicationJob: add option to prefer https:// over http://
(snagel:
[https://github.com/apache/nutch/commit/3958d0c23e32855225fd52403da7c7234eef5ea2])
* (edit) src/java/org/apache/nutch/crawl/DeduplicationJob.java
> DeduplicationJob: add option to prefer https:// over http://
> ------------------------------------------------------------
>
> Key: NUTCH-2683
> URL: https://issues.apache.org/jira/browse/NUTCH-2683
> Project: Nutch
> Issue Type: Improvement
> Components: crawldb
> Affects Versions: 1.15
> Reporter: Sebastian Nagel
> Assignee: Sebastian Nagel
> Priority: Major
> Fix For: 1.16
>
>
> The deduplication job allows to keep the shortest URLs as the "best" URL of a
> set of duplicates, marking all longer ones as duplicates. Recently search
> engines started to penalize non-https pages by [giving https pages a higher
> rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html]
> and [marking http as
> insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/].
> If URLs are identical except for the protocol the deduplication job should be
> able to prefer https:// over http:// URLs, although the latter ones are
> shorter by one character. Of course, this should be configurable and in
> addition to existing preferences (length, score and fetch time) to select the
> "best" URL among duplicates.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)