On 22.11.2012 17:01, Julian Foad wrote:
>> Author: brane
>> Modified: subversion/trunk/subversion/include/private/svn_string_private.h
>> +/**
>> + * Computes the similarity score STRA and STRB, given as the ratio
>> + * between the length of their longest common subsequence and the
>> + * length of the strings, normalized to the range [0..1000].
> "longest common non-contiguous subsequence"
Yes, we know it is, but that's not what the name of the algorithm is.
> "average length of the two strings"
This is true; however, saying "average" always sounds, to me, as if
we're dealing with a whole bunch of values, rather than just two.
> Just use 'float' and [0..1]? I'm thinking we may want to use this for
> comparing diff hunks, and there may be more than 500 characters in each hunk
> and it matters whether just one character has been changed.
Could do that, too. Harder to test though. :) It's trivial to extend the
range to [0..10**6] or even 10**9 if that's a problem. Frankly it feels
nice to keep the whole thing integral and not leave floating-point
messes around.
> [...]
>
>> Modified: subversion/trunk/subversion/tests/libsvn_subr/string-test.c
>> +static svn_error_t *
>> +test_string_similarity(apr_pool_t *pool)
>> +{
>> + const struct sim_score_test_t
>> + {
>> + const char *stra;
>> + const char *strb;
>> + apr_size_t lcs;
>> + int score;
>> + } tests[] =
>> + {
>> +#define SCORE(lcs, len) ((2000 * (lcs) + (len)/2) / (len))
>> +
>> + /* Equality */
>> + {"", "", 0, 1000},
>> + {"quoth", "quoth", 5, SCORE(5, 5+5)},
>> +
>> + /* Deletion at start */
>> + {"quoth", "uoth", 4, SCORE(4, 5+4)},
>> + {"uoth", "quoth", 4, SCORE(4, 4+5)},
>> +
>> + /* Deletion at end */
>> + {"quoth", "quot", 4, SCORE(4, 5+4)},
>> + {"quot", "quoth", 4, SCORE(4, 4+5)},
>> +
>> + /* Insertion at start */
>> + {"quoth", "Xquoth", 5, SCORE(5, 5+6)},
>> + {"Xquoth", "quoth", 5, SCORE(5, 6+5)},
>> +
>> + /* Insertion at end */
>> + {"quoth", "quothX", 5, SCORE(5, 5+6)},
>> + {"quothX", "quoth", 5, SCORE(5, 6+5)},
> Half of the examples in each of these four pairs are redundant.
> (The word "deletion" implies an ordering of string-A -> string-B, so
> by that terminology "uoth" -> "quoth" is an Insertion, same as
> "quoth" -> "Xquoth".)
Theoretically and academically speaking, yes, they're redundant.
Practically, however, this helped me find a bug that only popped up when
exactly one of these examples was removed. I love symmetric algorithms
but my bugs are always asymmetric and some of them break causality, too. :)
So in order to avoid falling into the incompleteness trap, I just
mirrored all the test cases. It's less of a problem to have duplicate
tests than it is to miss one.
-- Brane
--
Branko Čibej
Director of Subversion | WANdisco | www.wandisco.com