[
https://issues.apache.org/jira/browse/NUTCH-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15183112#comment-15183112
]
Ron van der Vegt commented on NUTCH-2237:
-----------------------------------------
Thanks for the feedback!
- maybe URLUtil.java would be the better place for the slug functions, next to
chooseRepr(...) which provides a similar functionality
I moved the count method to URLUtil, and included extra comments. The compare
method has been removed.
- URLs are now always decoded, even if the decision which URL/document to keep
is done solely by comparison of score or fetch time. Since decoding URLs isn't
a cheap computation
1. it should be done lazily, and
2. the result could be cached for later comparisons if there are more than
2 duplicates. This would be an improvement of the current state, but should be
done for both the decoded URL string and the slug length.
Decoding is lazy now and caching for both decoding and slug length has been
added.
- Is it safe to first decode the URL string and then parse the resulting string
as URL? After decoding there may be forbidden or
reserved characters, so that the URL path and query fail to get properly parsed.
Yes, when there are invalid characters and the decoding failes, the result of
countUrlSlugCharacters will be zero in which case will always lose while
comparing.
- no branch of this if clause is reachable given that compareUrlSlug(...)
returns -1, 0, or 1:
This is done by purpose. When the result is zero, mean equals, it should use
the next comparsion critera. The other critia's work the same way (score,
fetchtime etc.).
Please let me know what you think of these fixes.
> DeduplicationJob: Add extra order criteria based on slug
> --------------------------------------------------------
>
> Key: NUTCH-2237
> URL: https://issues.apache.org/jira/browse/NUTCH-2237
> Project: Nutch
> Issue Type: Improvement
> Reporter: Ron van der Vegt
> Fix For: 1.12
>
> Attachments: NUTCH-2237.patch
>
>
> Currently user can elect the main document when signatures are the same on
> score, url lenght and fetchtime. The quality of the slug, based mainly on the
> amount of meaningful characters, could give users more flexibility to make a
> difference between slugified urls and urls based on page id.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)