[ 
https://issues.apache.org/jira/browse/NUTCH-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16814387#comment-16814387
 ] 

Hudson commented on NUTCH-2683:
-------------------------------

FAILURE: Integrated in Jenkins build Nutch-trunk #3619 (See 
[https://builds.apache.org/job/Nutch-trunk/3619/])
NUTCH-2683 DeduplicationJob: add option to prefer https:// over http:// 
(snagel: 
[https://github.com/apache/nutch/commit/3958d0c23e32855225fd52403da7c7234eef5ea2])
* (edit) src/java/org/apache/nutch/crawl/DeduplicationJob.java


> DeduplicationJob: add option to prefer https:// over http://
> ------------------------------------------------------------
>
>                 Key: NUTCH-2683
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2683
>             Project: Nutch
>          Issue Type: Improvement
>          Components: crawldb
>    Affects Versions: 1.15
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.16
>
>
> The deduplication job allows to keep the shortest URLs as the "best" URL of a 
> set of duplicates, marking all longer ones as duplicates. Recently search 
> engines started to penalize non-https pages by [giving https pages a higher 
> rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html] 
> and [marking http as 
> insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/].
> If URLs are identical except for the protocol the deduplication job should be 
> able to prefer https:// over http:// URLs, although the latter ones are 
> shorter by one character. Of course, this should be configurable and in 
> addition to existing preferences (length, score and fetch time) to select the 
> "best" URL among duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to