[ https://issues.apache.org/jira/browse/NUTCH-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15286780#comment-15286780 ]
Sebastian Nagel commented on NUTCH-2237: ---------------------------------------- Hi [~ronvandervegt], sorry for the late reply and thanks for the updated patch. Points which may be worth to consider: # caching of slug counts: a global cache is dangerous because it may consume huge memory if the CrawlDb to be deduplicated contains many candidates for deduplication. Also a global cache is not necessary because all URLs with the same signature are passed to the reduce() function. It's enough to cache the slug count of the best URL (shortest in terms of meaningful characters). # same for decoded URLs: cache the decoded URL of existingDoc # "Is it safe to first decode the URL string and then parse the resulting string as URL?" * "when there are invalid characters and the decoding failes, the result of countUrlSlugCharacters will be zero in which case will always lose while comparing" -- but the shortest URL is kept, in doubt, that will cause the invalid one to survive. * for very special URLs the character counts is wrong, e.g., "?" is dropped if decoded too early: {code} Assert.assertEquals(7, URLUtil.countUrlSlugCharacters(URLDecoder.decode("https://de.wikipedia.org/wiki/%3f%3f%3f", "UTF-8"))); {code} And some trivialities... * order of arguments: assertEquals(<expected>, <actual>) to get proper error messages: {{java.lang.AssertionError: expected:<5> but was:<4>}} * please, do not use {{e.printStackTrace()}} but {{LOG.error("..." + StringUtils.stringifyException(e));}}. Alternatively, ignore the exception. > DeduplicationJob: Add extra order criteria based on slug > -------------------------------------------------------- > > Key: NUTCH-2237 > URL: https://issues.apache.org/jira/browse/NUTCH-2237 > Project: Nutch > Issue Type: Improvement > Reporter: Ron van der Vegt > Fix For: 1.12 > > Attachments: NUTCH-2237.patch, NUTCH-2237.patch > > > Currently user can elect the main document when signatures are the same on > score, url lenght and fetchtime. The quality of the slug, based mainly on the > amount of meaningful characters, could give users more flexibility to make a > difference between slugified urls and urls based on page id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)