Sebastian Nagel created NUTCH-2683:
--------------------------------------

             Summary: DeduplicationJob: add option to prefer https:// over 
http://
                 Key: NUTCH-2683
                 URL: https://issues.apache.org/jira/browse/NUTCH-2683
             Project: Nutch
          Issue Type: Improvement
          Components: crawldb
    Affects Versions: 1.15
            Reporter: Sebastian Nagel
             Fix For: 1.16


The deduplication job allows to keep the shortest URLs as the "best" URL of a 
set of duplicates, marking all longer ones as duplicates. Recently search 
engines started to penalize non-https pages by [giving https pages a higher 
rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html] 
and [marking http as 
insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/].

If URLs are identical except for the protocol the deduplication job should be 
able to prefer https:// over http:// URLs, although the latter ones are shorter 
by one character. Of course, this should be configurable and in addition to 
existing preferences (length, score and fetch time) to select the "best" URL 
among duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to