[jira] [Commented] (NUTCH-2237) DeduplicationJob: Add extra order criteria based on slug

Sebastian Nagel (JIRA) Tue, 17 May 2016 08:00:24 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15286780#comment-15286780
 ]


Sebastian Nagel commented on NUTCH-2237:
----------------------------------------

Hi [~ronvandervegt], sorry for the late reply and thanks for the updated patch. 
Points which may be worth to consider:
# caching of slug counts: a global cache is dangerous because it may consume 
huge memory if the CrawlDb to be deduplicated contains many candidates for 
deduplication. Also a global cache is not necessary because all URLs with the 
same signature are passed to the reduce() function. It's enough to cache the 
slug count of the best URL (shortest in terms of meaningful characters).
# same for decoded URLs: cache the decoded URL of existingDoc
# "Is it safe to first decode the URL string and then parse the resulting 
string as URL?"
* "when there are invalid characters and the decoding failes, the result of 
countUrlSlugCharacters will be zero in which case will always lose while 
comparing" -- but the shortest URL is kept, in doubt, that will cause the 
invalid one to survive.
* for very special URLs the character counts is wrong, e.g., "?" is dropped if 
decoded too early:
{code}
 Assert.assertEquals(7, 
URLUtil.countUrlSlugCharacters(URLDecoder.decode("https://de.wikipedia.org/wiki/%3f%3f%3f";,
 "UTF-8")));
{code}
And some trivialities...
* order of arguments: assertEquals(<expected>, <actual>) to get proper error 
messages: {{java.lang.AssertionError: expected:<5> but was:<4>}}
* please, do not use {{e.printStackTrace()}} but {{LOG.error("..." + 
StringUtils.stringifyException(e));}}. Alternatively, ignore the exception.

> DeduplicationJob: Add extra order criteria based on slug
> --------------------------------------------------------
>
>                 Key: NUTCH-2237
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2237
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ron van der Vegt
>             Fix For: 1.12
>
>         Attachments: NUTCH-2237.patch, NUTCH-2237.patch
>
>
> Currently user can elect the main document when signatures are the same on 
> score, url lenght and fetchtime. The quality of the slug, based mainly on the 
> amount of meaningful characters, could give users more flexibility to make a 
> difference between slugified urls and urls based on page id.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2237) DeduplicationJob: Add extra order criteria based on slug

Reply via email to