Ron van der Vegt created NUTCH-2232:
---------------------------------------
Summary: DeduplicationJob: Url is not decoded before the url
length is compared.
Key: NUTCH-2232
URL: https://issues.apache.org/jira/browse/NUTCH-2232
Project: Nutch
Issue Type: Bug
Components: crawldb
Reporter: Ron van der Vegt
When certain documents have the same signature de deduplication script will
elect one as duplicate. The urls are stored url encoded in the crawldb. When
two urls are compared by url length, the urls are not first decoded. This could
lead to misleading url length.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)