[ 
https://issues.apache.org/jira/browse/NUTCH-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15178587#comment-15178587
 ] 

Sebastian Nagel commented on NUTCH-2237:
----------------------------------------

Good idea! Nice patch, including unit tests. A few comments for possible 
improvements:
* maybe URLUtil.java would be the better place for the slug functions, next to 
chooseRepr(...) which provides a similar functionality
* URLs are now always decoded, even if the decision which URL/document to keep 
is done solely by comparison of score or fetch time. Since decoding URLs isn't 
a cheap computation
*# it should be done lazily, and
*# the result could be cached for later comparisons if there are more than 2 
duplicates. This would be an improvement of the current state, but should be 
done for both the decoded URL string and the slug length.
* Is it safe to first decode the URL string and then parse the resulting string 
as URL? After decoding there may be forbidden or reserved characters, so that 
the URL path and query fail to get properly parsed.
* no branch of this if clause is reachable given that compareUrlSlug(...) 
returns -1, 0, or 1:
{code}
if (compareUrlSlug(urlExisting, urlnewDoc) > 1) {
  // mark new one as duplicate
  ...
} else if (compareUrlSlug(urlnewDoc, urlExisting) > 1) {
{code}


> DeduplicationJob: Add extra order criteria based on slug
> --------------------------------------------------------
>
>                 Key: NUTCH-2237
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2237
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ron van der Vegt
>             Fix For: 1.12
>
>         Attachments: NUTCH-2237.patch
>
>
> Currently user can elect the main document when signatures are the same on 
> score, url lenght and fetchtime. The quality of the slug, based mainly on the 
> amount of meaningful characters, could give users more flexibility to make a 
> difference between slugified urls and urls based on page id.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to