[ 
https://issues.apache.org/jira/browse/NUTCH-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163003#comment-15163003
 ] 

Markus Jelsma commented on NUTCH-2232:
--------------------------------------

Yes, there is clearly a difference in length between 
{{https://zh.wikipedia.org/wiki/馬伯利訴麥迪遜案}} and 
{{https://zh.wikipedia.org/wiki/%E9%A9%AC%E4%BC%AF%E5%88%A9%E8%AF%89%E9%BA%A6%E8%BF%AA%E9%80%8A%E6%A1%88}}.
 This could in some cases result in weird unexpected behaviour.

> DeduplicationJob: Url is not decoded before the url length is compared.
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-2232
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2232
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb
>    Affects Versions: 1.11
>            Reporter: Ron van der Vegt
>             Fix For: 1.12
>
>         Attachments: NUTCH-2232.patch
>
>
> When certain documents have the same signature de deduplication script will 
> elect one as duplicate. The urls are stored url encoded in the crawldb. When 
> two urls are compared by url length, the urls are not first decoded. This 
> could lead to misleading url length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to