[jira] [Commented] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://

ASF GitHub Bot (JIRA) Mon, 07 Jan 2019 03:15:04 -0800


    [ 
https://issues.apache.org/jira/browse/NUTCH-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735682#comment-16735682
 ]


ASF GitHub Bot commented on NUTCH-2683:
---------------------------------------

sebastian-nagel commented on pull request #425: NUTCH-2683 DeduplicationJob: 
add option to prefer https:// over http://
URL: https://github.com/apache/nutch/pull/425
 
 
   - add optional value "httpsOverHttp" to -compareOrder argument to prefer 
https:// over http:// if it comes before the "urlLength" and neither "score" 
nor "fetchTime" take precedence
   - code improvements: remove nested loop, sort imports, add `@Override` 
statements where applicable
   
   Testing with one pair of https/http duplicates:
   ```
   % cat seeds.txt 
   http://nutch.apache.org/
   https://nutch.apache.org/
   
   % nutch inject crawldb seeds.txt
   ...
   
   % nutch generate crawldb/ segments
   ...
   
   % nutch fetch segments/*
   ...
   
   % nutch parse segments/*
   ...
   
   % nutch updatedb crawldb/ segments/*
   ...
   
   % nutch dedup crawldb -compareOrder httpsOverHttp,score,urlLength,fetchTime
   ...
   Deduplication: 1 documents marked as duplicates
   ...
   
   % nutch readdb crawldb/ -url https://nutch.apache.org/
   URL: https://nutch.apache.org/
   Version: 7
   Status: 2 (db_fetched)
   Fetch time: Wed Feb 06 11:55:33 CET 2019
   Modified time: Mon Jan 07 11:55:33 CET 2019
   Retries since fetch: 0
   Retry interval: 2592000 seconds (30 days)
   Score: 1.1800001
   Signature: da0ffbf19768ea2cab9ffa0fb4a778a7
   Metadata: 
   ...
   
   % nutch readdb crawldb/ -url http://nutch.apache.org/
   URL: http://nutch.apache.org/
   Version: 7
   Status: 7 (db_duplicate)
   Fetch time: Wed Feb 06 11:55:39 CET 2019
   Modified time: Mon Jan 07 11:55:39 CET 2019
   Retries since fetch: 0
   Retry interval: 2592000 seconds (30 days)
   Score: 1.1800001
   Signature: da0ffbf19768ea2cab9ffa0fb4a778a7
   Metadata: 
   ...
   ```
   The URL `https://nutch.apache.org/` is kept as expected if "httpsOverHttp" 
is configured.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> DeduplicationJob: add option to prefer https:// over http://
> ------------------------------------------------------------
>
>                 Key: NUTCH-2683
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2683
>             Project: Nutch
>          Issue Type: Improvement
>          Components: crawldb
>    Affects Versions: 1.15
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.16
>
>
> The deduplication job allows to keep the shortest URLs as the "best" URL of a 
> set of duplicates, marking all longer ones as duplicates. Recently search 
> engines started to penalize non-https pages by [giving https pages a higher 
> rank|https://webmasters.googleblog.com/2014/08/https-as-ranking-signal.html] 
> and [marking http as 
> insecure|https://www.blog.google/products/chrome/milestone-chrome-security-marking-http-not-secure/].
> If URLs are identical except for the protocol the deduplication job should be 
> able to prefer https:// over http:// URLs, although the latter ones are 
> shorter by one character. Of course, this should be configurable and in 
> addition to existing preferences (length, score and fetch time) to select the 
> "best" URL among duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2683) DeduplicationJob: add option to prefer https:// over http://

Reply via email to