Sebastian Nagel created NUTCH-2683:
--------------------------------------
Summary: DeduplicationJob: add option to prefer https:// over
http://
Key: NUTCH-2683
URL: https://issues.apache.org/jira/browse/NUTCH-2683
Project: Nutch
Issue Type: Improvement
Components: crawldb
Affects Versions: 1.15
Reporter: Sebastian Nagel
Fix For: 1.16
The deduplication job allows to keep the shortest URLs as the "best" URL of a
set of duplicates, marking all longer ones as duplicates. Recently search
engines started to penalize non-https pages by [giving https pages a higher
rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html]
and [marking http as
insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/].
If URLs are identical except for the protocol the deduplication job should be
able to prefer https:// over http:// URLs, although the latter ones are shorter
by one character. Of course, this should be configurable and in addition to
existing preferences (length, score and fetch time) to select the "best" URL
among duplicates.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)