[jira] [Commented] (NUTCH-2237) DeduplicationJob: Add extra order criteria based on slug
[ https://issues.apache.org/jira/browse/NUTCH-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938767#comment-16938767 ] Sebastian Nagel commented on NUTCH-2237: Moving to 1.17, open points should be addressed. > DeduplicationJob: Add extra order criteria based on slug > > > Key: NUTCH-2237 > URL: https://issues.apache.org/jira/browse/NUTCH-2237 > Project: Nutch > Issue Type: Improvement >Reporter: Ron van der Vegt >Priority: Major > Fix For: 1.16 > > Attachments: NUTCH-2237.patch, NUTCH-2237.patch > > > Currently user can elect the main document when signatures are the same on > score, url lenght and fetchtime. The quality of the slug, based mainly on the > amount of meaningful characters, could give users more flexibility to make a > difference between slugified urls and urls based on page id. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2237) DeduplicationJob: Add extra order criteria based on slug
[ https://issues.apache.org/jira/browse/NUTCH-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183112#comment-15183112 ] Ron van der Vegt commented on NUTCH-2237: - Thanks for the feedback! - maybe URLUtil.java would be the better place for the slug functions, next to chooseRepr(...) which provides a similar functionality I moved the count method to URLUtil, and included extra comments. The compare method has been removed. - URLs are now always decoded, even if the decision which URL/document to keep is done solely by comparison of score or fetch time. Since decoding URLs isn't a cheap computation 1. it should be done lazily, and 2. the result could be cached for later comparisons if there are more than 2 duplicates. This would be an improvement of the current state, but should be done for both the decoded URL string and the slug length. Decoding is lazy now and caching for both decoding and slug length has been added. - Is it safe to first decode the URL string and then parse the resulting string as URL? After decoding there may be forbidden or reserved characters, so that the URL path and query fail to get properly parsed. Yes, when there are invalid characters and the decoding failes, the result of countUrlSlugCharacters will be zero in which case will always lose while comparing. - no branch of this if clause is reachable given that compareUrlSlug(...) returns -1, 0, or 1: This is done by purpose. When the result is zero, mean equals, it should use the next comparsion critera. The other critia's work the same way (score, fetchtime etc.). Please let me know what you think of these fixes. > DeduplicationJob: Add extra order criteria based on slug > > > Key: NUTCH-2237 > URL: https://issues.apache.org/jira/browse/NUTCH-2237 > Project: Nutch > Issue Type: Improvement >Reporter: Ron van der Vegt > Fix For: 1.12 > > Attachments: NUTCH-2237.patch > > > Currently user can elect the main document when signatures are the same on > score, url lenght and fetchtime. The quality of the slug, based mainly on the > amount of meaningful characters, could give users more flexibility to make a > difference between slugified urls and urls based on page id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2237) DeduplicationJob: Add extra order criteria based on slug
[ https://issues.apache.org/jira/browse/NUTCH-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178587#comment-15178587 ] Sebastian Nagel commented on NUTCH-2237: Good idea! Nice patch, including unit tests. A few comments for possible improvements: * maybe URLUtil.java would be the better place for the slug functions, next to chooseRepr(...) which provides a similar functionality * URLs are now always decoded, even if the decision which URL/document to keep is done solely by comparison of score or fetch time. Since decoding URLs isn't a cheap computation *# it should be done lazily, and *# the result could be cached for later comparisons if there are more than 2 duplicates. This would be an improvement of the current state, but should be done for both the decoded URL string and the slug length. * Is it safe to first decode the URL string and then parse the resulting string as URL? After decoding there may be forbidden or reserved characters, so that the URL path and query fail to get properly parsed. * no branch of this if clause is reachable given that compareUrlSlug(...) returns -1, 0, or 1: {code} if (compareUrlSlug(urlExisting, urlnewDoc) > 1) { // mark new one as duplicate ... } else if (compareUrlSlug(urlnewDoc, urlExisting) > 1) { {code} > DeduplicationJob: Add extra order criteria based on slug > > > Key: NUTCH-2237 > URL: https://issues.apache.org/jira/browse/NUTCH-2237 > Project: Nutch > Issue Type: Improvement >Reporter: Ron van der Vegt > Fix For: 1.12 > > Attachments: NUTCH-2237.patch > > > Currently user can elect the main document when signatures are the same on > score, url lenght and fetchtime. The quality of the slug, based mainly on the > amount of meaningful characters, could give users more flexibility to make a > difference between slugified urls and urls based on page id. -- This message was sent by Atlassian JIRA (v6.3.4#6332)