ASF GitHub Bot commented on NUTCH-2683:

sebastian-nagel commented on pull request #425: NUTCH-2683 DeduplicationJob: 
add option to prefer https:// over http://
URL: https://github.com/apache/nutch/pull/425
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

> DeduplicationJob: add option to prefer https:// over http://
> ------------------------------------------------------------
>                 Key: NUTCH-2683
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2683
>             Project: Nutch
>          Issue Type: Improvement
>          Components: crawldb
>    Affects Versions: 1.15
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.16
> The deduplication job allows to keep the shortest URLs as the "best" URL of a 
> set of duplicates, marking all longer ones as duplicates. Recently search 
> engines started to penalize non-https pages by [giving https pages a higher 
> rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html] 
> and [marking http as 
> insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/].
> If URLs are identical except for the protocol the deduplication job should be 
> able to prefer https:// over http:// URLs, although the latter ones are 
> shorter by one character. Of course, this should be configurable and in 
> addition to existing preferences (length, score and fetch time) to select the 
> "best" URL among duplicates.

This message was sent by Atlassian JIRA

Reply via email to