[
https://issues.apache.org/jira/browse/NUTCH-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15286780#comment-15286780
]
Sebastian Nagel commented on NUTCH-2237:
----------------------------------------
Hi [~ronvandervegt], sorry for the late reply and thanks for the updated patch.
Points which may be worth to consider:
# caching of slug counts: a global cache is dangerous because it may consume
huge memory if the CrawlDb to be deduplicated contains many candidates for
deduplication. Also a global cache is not necessary because all URLs with the
same signature are passed to the reduce() function. It's enough to cache the
slug count of the best URL (shortest in terms of meaningful characters).
# same for decoded URLs: cache the decoded URL of existingDoc
# "Is it safe to first decode the URL string and then parse the resulting
string as URL?"
* "when there are invalid characters and the decoding failes, the result of
countUrlSlugCharacters will be zero in which case will always lose while
comparing" -- but the shortest URL is kept, in doubt, that will cause the
invalid one to survive.
* for very special URLs the character counts is wrong, e.g., "?" is dropped if
decoded too early:
{code}
Assert.assertEquals(7,
URLUtil.countUrlSlugCharacters(URLDecoder.decode("https://de.wikipedia.org/wiki/%3f%3f%3f",
"UTF-8")));
{code}
And some trivialities...
* order of arguments: assertEquals(<expected>, <actual>) to get proper error
messages: {{java.lang.AssertionError: expected:<5> but was:<4>}}
* please, do not use {{e.printStackTrace()}} but {{LOG.error("..." +
StringUtils.stringifyException(e));}}. Alternatively, ignore the exception.
> DeduplicationJob: Add extra order criteria based on slug
> --------------------------------------------------------
>
> Key: NUTCH-2237
> URL: https://issues.apache.org/jira/browse/NUTCH-2237
> Project: Nutch
> Issue Type: Improvement
> Reporter: Ron van der Vegt
> Fix For: 1.12
>
> Attachments: NUTCH-2237.patch, NUTCH-2237.patch
>
>
> Currently user can elect the main document when signatures are the same on
> score, url lenght and fetchtime. The quality of the slug, based mainly on the
> amount of meaningful characters, could give users more flexibility to make a
> difference between slugified urls and urls based on page id.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)