[
https://issues.apache.org/jira/browse/NUTCH-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735682#comment-16735682
]
ASF GitHub Bot commented on NUTCH-2683:
---------------------------------------
sebastian-nagel commented on pull request #425: NUTCH-2683 DeduplicationJob:
add option to prefer https:// over http://
URL: https://github.com/apache/nutch/pull/425
- add optional value "httpsOverHttp" to -compareOrder argument to prefer
https:// over http:// if it comes before the "urlLength" and neither "score"
nor "fetchTime" take precedence
- code improvements: remove nested loop, sort imports, add `@Override`
statements where applicable
Testing with one pair of https/http duplicates:
```
% cat seeds.txt
http://nutch.apache.org/
https://nutch.apache.org/
% nutch inject crawldb seeds.txt
...
% nutch generate crawldb/ segments
...
% nutch fetch segments/*
...
% nutch parse segments/*
...
% nutch updatedb crawldb/ segments/*
...
% nutch dedup crawldb -compareOrder httpsOverHttp,score,urlLength,fetchTime
...
Deduplication: 1 documents marked as duplicates
...
% nutch readdb crawldb/ -url https://nutch.apache.org/
URL: https://nutch.apache.org/
Version: 7
Status: 2 (db_fetched)
Fetch time: Wed Feb 06 11:55:33 CET 2019
Modified time: Mon Jan 07 11:55:33 CET 2019
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.1800001
Signature: da0ffbf19768ea2cab9ffa0fb4a778a7
Metadata:
...
% nutch readdb crawldb/ -url http://nutch.apache.org/
URL: http://nutch.apache.org/
Version: 7
Status: 7 (db_duplicate)
Fetch time: Wed Feb 06 11:55:39 CET 2019
Modified time: Mon Jan 07 11:55:39 CET 2019
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.1800001
Signature: da0ffbf19768ea2cab9ffa0fb4a778a7
Metadata:
...
```
The URL `https://nutch.apache.org/` is kept as expected if "httpsOverHttp"
is configured.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> DeduplicationJob: add option to prefer https:// over http://
> ------------------------------------------------------------
>
> Key: NUTCH-2683
> URL: https://issues.apache.org/jira/browse/NUTCH-2683
> Project: Nutch
> Issue Type: Improvement
> Components: crawldb
> Affects Versions: 1.15
> Reporter: Sebastian Nagel
> Priority: Major
> Fix For: 1.16
>
>
> The deduplication job allows to keep the shortest URLs as the "best" URL of a
> set of duplicates, marking all longer ones as duplicates. Recently search
> engines started to penalize non-https pages by [giving https pages a higher
> rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html]
> and [marking http as
> insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/].
> If URLs are identical except for the protocol the deduplication job should be
> able to prefer https:// over http:// URLs, although the latter ones are
> shorter by one character. Of course, this should be configurable and in
> addition to existing preferences (length, score and fetch time) to select the
> "best" URL among duplicates.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)