[ https://issues.apache.org/jira/browse/NUTCH-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma resolved NUTCH-2232. ---------------------------------- Resolution: Fixed Assignee: Markus Jelsma Committed to trunk in revision 1732160. Thanks Ron van der Vegt > DeduplicationJob should decode URL's before length is compared > -------------------------------------------------------------- > > Key: NUTCH-2232 > URL: https://issues.apache.org/jira/browse/NUTCH-2232 > Project: Nutch > Issue Type: Bug > Components: crawldb > Affects Versions: 1.11 > Reporter: Ron van der Vegt > Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2232.patch, NUTCH-2232.patch > > > When certain documents have the same signature de deduplication script will > elect one as duplicate. The urls are stored url encoded in the crawldb. When > two urls are compared by url length, the urls are not first decoded. This > could lead to misleading url length. -- This message was sent by Atlassian JIRA (v6.3.4#6332)