[jira] [Commented] (NUTCH-2237) DeduplicationJob: Add extra order criteria based on slug

Ron van der Vegt (JIRA) Mon, 07 Mar 2016 07:08:38 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15183112#comment-15183112
 ]


Ron van der Vegt commented on NUTCH-2237:
-----------------------------------------

Thanks for the feedback!

- maybe URLUtil.java would be the better place for the slug functions, next to 
chooseRepr(...) which provides a similar functionality

I moved the count method to URLUtil, and included extra comments. The compare 
method has been removed.

- URLs are now always decoded, even if the decision which URL/document to keep 
is done solely by comparison of score or fetch time. Since decoding URLs isn't 
a cheap computation
    1. it should be done lazily, and
    2. the result could be cached for later comparisons if there are more than 
2 duplicates. This would be an improvement of the current state, but should be 
done for both the decoded URL string and the slug length.

Decoding is lazy now and caching for both decoding and slug length has been 
added.

- Is it safe to first decode the URL string and then parse the resulting string 
as URL? After decoding there may be forbidden or 
reserved characters, so that the URL path and query fail to get properly parsed.

Yes, when there are invalid characters and the decoding failes, the result of 
countUrlSlugCharacters will be zero in which case will always lose while 
comparing.

- no branch of this if clause is reachable given that compareUrlSlug(...) 
returns -1, 0, or 1:

This is done by purpose. When the result is zero, mean equals, it should use 
the next comparsion critera. The other critia's work the same way (score, 
fetchtime etc.).

Please let me know what you think of these fixes.


> DeduplicationJob: Add extra order criteria based on slug
> --------------------------------------------------------
>
>                 Key: NUTCH-2237
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2237
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ron van der Vegt
>             Fix For: 1.12
>
>         Attachments: NUTCH-2237.patch
>
>
> Currently user can elect the main document when signatures are the same on 
> score, url lenght and fetchtime. The quality of the slug, based mainly on the 
> amount of meaningful characters, could give users more flexibility to make a 
> difference between slugified urls and urls based on page id.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2237) DeduplicationJob: Add extra order criteria based on slug

Reply via email to