Re: [PATCH] Prefer simple case changes in spelling suggestions
On Fri, 2020-06-05 at 18:23 +, Pip Cet via Gcc-patches wrote: > David Malcolm writes: > > > On Sat, 2020-05-30 at 18:51 +, Pip Cet wrote: > > > How's this? > > > > Thanks; looks good to me. Hopefully this doesn't clash with Tom's > > patch. > > It doesn't, but I hope I got the commit message right this time. > > (I don't have git access, so if someone could commit this if it's > approved, that would be great). Thanks. Pushed. jeff
Re: [PATCH] Prefer simple case changes in spelling suggestions
David Malcolm writes: > On Sat, 2020-05-30 at 18:51 +, Pip Cet wrote: >> How's this? > > Thanks; looks good to me. Hopefully this doesn't clash with Tom's > patch. It doesn't, but I hope I got the commit message right this time. (I don't have git access, so if someone could commit this if it's approved, that would be great). >From 7e89a6be601b268b134e20c2702bd55945388a22 Mon Sep 17 00:00:00 2001 From: Pip Cet Date: Sat, 30 May 2020 13:39:09 + Subject: [PATCH] Don't test the triangle inequality in the spellcheck.c self-test The variant of editing distance we use doesn't satisfy the triangle inequality. gcc/ChangeLog: * spellcheck.c (test_data): Add problematic strings. (test_metric_conditions): Don't test the triangle inequality condition, which our distance function does not satisfy. --- gcc/spellcheck.c | 22 -- 1 file changed, 8 insertions(+), 14 deletions(-) diff --git a/gcc/spellcheck.c b/gcc/spellcheck.c index 9f7351f364f..45c41d7cef9 100644 --- a/gcc/spellcheck.c +++ b/gcc/spellcheck.c @@ -474,13 +474,17 @@ static const char * const test_data[] = { "food", "boo", "1234567890123456789012345678901234567890123456789012345678901234567890" + "abc", + "ac", + "ca", }; /* Verify that get_edit_distance appears to be a sane distance function, - i.e. the conditions for being a metric. This is done directly for a - small set of examples, using test_data above. This is O(N^3) in the size - of the array, due to the test for the triangle inequality, so we keep the - array small. */ + even though it doesn't satisfy the conditions for being a metric. (This + is because the triangle inequality fails to hold: the distance between + "ca" and "ac" is 1, and so is the distance between "abc" and "ac", but + the distance between "abc" and "ca" is 3. Algorithms that calculate the + true Levenshtein-Damerau metric are much more expensive.) */ static void test_metric_conditions () @@ -504,16 +508,6 @@ test_metric_conditions () edit_distance_t dist_ji = get_edit_distance (test_data[j], test_data[i]); ASSERT_EQ (dist_ij, dist_ji); - - /* Triangle inequality. */ - for (int k = 0; k < num_test_cases; k++) - { - edit_distance_t dist_ik - = get_edit_distance (test_data[i], test_data[k]); - edit_distance_t dist_jk - = get_edit_distance (test_data[j], test_data[k]); - ASSERT_TRUE (dist_ik <= dist_ij + dist_jk); - } } } } -- 2.27.0.rc0
Re: [PATCH] Prefer simple case changes in spelling suggestions
On Sat, 2020-05-30 at 18:51 +, Pip Cet wrote: > On Sat, May 30, 2020 at 5:06 PM David Malcolm > wrote: > > On Sat, 2020-05-30 at 13:40 +, Pip Cet via Gcc-patches wrote: > > > I think we should just omit the triangle inequality test from the > > > self-test, as in the attached patch. > > > > I like the idea, > > Thanks! > > > but can you update the comment so that it explains why > > it's not a metric? (better to capture that in a comment in the > > source > > rather than just in the mailing list archives). > > How's this? Thanks; looks good to me. Hopefully this doesn't clash with Tom's patch. Dave
Re: [PATCH] Prefer simple case changes in spelling suggestions
On Mon, 2020-06-01 at 14:11 -0600, Tom Tromey wrote: > > Did the full DejaGnu testsuite get run? There are a lot of tests > > in it > > that make use of this code. > > I did "make check" and only saw some XFAILs. > > Here's v2 of the patch, which I think addresses your comments. I did > not add a new test of get_edit_distance, because as I mentioned > earlier, > an existing test already does what you asked for. > > Tom > > commit e897a99dada8d3935343ebf7b14ad7ec36515b3d > Author: Tom Tromey > Date: Fri May 29 10:46:57 2020 -0600 > > Prefer simple case changes in spelling suggestions > > I got this error message when editing gcc and recompiling: > > ../../gcc/gcc/ada/gcc-interface/decl.c:7714:39: error: > ‘DWARF_GNAT_ENCODINGS_all’ was not declared in this scope; did you > mean ‘DWARF_GNAT_ENCODINGS_GDB’? > 7714 | = debug_info && gnat_encodings == > DWARF_GNAT_ENCODINGS_all; > | ^~~ > ~ > | DWARF_GNAT_ENCODING > S_GDB > > This suggestion could be improved -- what happened here is that I > failed to upper-case the word, and DWARF_GNAT_ENCODINGS_ALL was > the > correct spelling. > > This patch changes gcc's spell checker to prefer simple case > changes > when possible. > > I tested this using the self-tests. A new self-test is also > included. > > gcc/ChangeLog: > > * spellcheck.c (CASE_COST): New define. > (BASE_COST): New define. > (get_edit_distance): Recognize case changes. > (get_edit_distance_cutoff): Update. > (test_edit_distances): Update. > (get_old_cutoff): Update. > (test_find_closest_string): Add case sensitivity test. Thanks; looks good to me. Dave
Re: [PATCH] Prefer simple case changes in spelling suggestions
> Did the full DejaGnu testsuite get run? There are a lot of tests in it > that make use of this code. I did "make check" and only saw some XFAILs. Here's v2 of the patch, which I think addresses your comments. I did not add a new test of get_edit_distance, because as I mentioned earlier, an existing test already does what you asked for. Tom commit e897a99dada8d3935343ebf7b14ad7ec36515b3d Author: Tom Tromey Date: Fri May 29 10:46:57 2020 -0600 Prefer simple case changes in spelling suggestions I got this error message when editing gcc and recompiling: ../../gcc/gcc/ada/gcc-interface/decl.c:7714:39: error: ‘DWARF_GNAT_ENCODINGS_all’ was not declared in this scope; did you mean ‘DWARF_GNAT_ENCODINGS_GDB’? 7714 | = debug_info && gnat_encodings == DWARF_GNAT_ENCODINGS_all; | ^~~~ | DWARF_GNAT_ENCODINGS_GDB This suggestion could be improved -- what happened here is that I failed to upper-case the word, and DWARF_GNAT_ENCODINGS_ALL was the correct spelling. This patch changes gcc's spell checker to prefer simple case changes when possible. I tested this using the self-tests. A new self-test is also included. gcc/ChangeLog: * spellcheck.c (CASE_COST): New define. (BASE_COST): New define. (get_edit_distance): Recognize case changes. (get_edit_distance_cutoff): Update. (test_edit_distances): Update. (get_old_cutoff): Update. (test_find_closest_string): Add case sensitivity test. diff --git a/gcc/spellcheck.c b/gcc/spellcheck.c index 7891260a258..9f7351f364f 100644 --- a/gcc/spellcheck.c +++ b/gcc/spellcheck.c @@ -25,14 +25,22 @@ along with GCC; see the file COPYING3. If not see #include "spellcheck.h" #include "selftest.h" +/* Cost of a case transformation. */ +#define CASE_COST 1 + +/* Cost of another kind of edit. */ +#define BASE_COST 2 + /* Get the edit distance between the two strings: the minimal number of edits that are needed to change one string into another, where edits can be one-character insertions, removals, or substitutions, or transpositions of two adjacent characters (counting as one "edit"). - This implementation uses the Wagner-Fischer algorithm for the - Damerau-Levenshtein distance; specifically, the "optimal string alignment - distance" or "restricted edit distance" variant. */ + This implementation uses a modified variant of the Wagner-Fischer + algorithm for the Damerau-Levenshtein distance; specifically, the + "optimal string alignment distance" or "restricted edit distance" + variant. This implementation has been further modified to take + case into account. */ edit_distance_t get_edit_distance (const char *s, int len_s, @@ -47,9 +55,9 @@ get_edit_distance (const char *s, int len_s, } if (len_s == 0) -return len_t; +return BASE_COST * len_t; if (len_t == 0) -return len_s; +return BASE_COST * len_s; /* We effectively build a matrix where each (i, j) contains the distance between the prefix strings s[0:j] and t[0:i]. @@ -67,7 +75,7 @@ get_edit_distance (const char *s, int len_s, /* The first row is for the case of an empty target string, which we can reach by deleting every character in the source string. */ for (int i = 0; i < len_s + 1; i++) -v_one_ago[i] = i; +v_one_ago[i] = i * BASE_COST; /* Build successive rows. */ for (int i = 0; i < len_t; i++) @@ -83,21 +91,28 @@ get_edit_distance (const char *s, int len_s, /* The initial column is for the case of an empty source string; we can reach prefixes of the target string of length i by inserting i characters. */ - v_next[0] = i + 1; + v_next[0] = (i + 1) * BASE_COST; /* Build the rest of the row by considering neighbors to the north, west and northwest. */ for (int j = 0; j < len_s; j++) { - edit_distance_t cost = (s[j] == t[i] ? 0 : 1); - edit_distance_t deletion = v_next[j] + 1; - edit_distance_t insertion= v_one_ago[j + 1] + 1; + edit_distance_t cost; + + if (s[j] == t[i]) + cost = 0; + else if (TOLOWER (s[j]) == TOLOWER (t[i])) + cost = CASE_COST; + else + cost = BASE_COST; + edit_distance_t deletion = v_next[j] + BASE_COST; + edit_distance_t insertion= v_one_ago[j + 1] + BASE_COST; edit_distance_t substitution = v_one_ago[j] + cost; edit_distance_t cheapest = MIN (deletion, insertion); cheapest = MIN (cheapest, substitution); if (i > 0 && j > 0 && s[j] == t[i - 1] && s[j - 1] == t[i]) { - edit_distance_t transposition = v_two_ago[j - 1] + 1; + edit_distance_t
Re: [PATCH] Prefer simple case changes in spelling suggestions
> "David" == David Malcolm writes: >> I tested this using the self-tests. A new self-test is also >> included. > Did the full DejaGnu testsuite get run? There are a lot of tests in it > that make use of this code. I didn't try it, but I can. > The patch should probably update the leading comment to > get_edit_distance. Will do. >> test_get_edit_distance_both_ways ("foo", "FOO", 3); [...] > If I'm reading things correctly, the patch here updates the existing > tests to apply the BASE_COST scale factor, but I don't think it adds > any direct checks of the cost of case-conversion. It would be good to > add those. It isn't obvious but the foo/FOO test did change. Tom
Re: [PATCH] Prefer simple case changes in spelling suggestions
On Sat, May 30, 2020 at 5:06 PM David Malcolm wrote: > On Sat, 2020-05-30 at 13:40 +, Pip Cet via Gcc-patches wrote: > > I think we should just omit the triangle inequality test from the > > self-test, as in the attached patch. > > I like the idea, Thanks! > but can you update the comment so that it explains why > it's not a metric? (better to capture that in a comment in the source > rather than just in the mailing list archives). How's this? From 2a759aba3639dd782c28c727746c0e6913390fa6 Mon Sep 17 00:00:00 2001 From: Pip Cet Date: Sat, 30 May 2020 13:39:09 + Subject: [PATCH] Don't test the triangle inequality in the spellcheck.c self-test. * spellcheck.c (test_data): Add problematic strings. (test_metric_conditions): Don't test the triangle inequality condition, which our distance function does not satisfy. --- gcc/ChangeLog| 6 ++ gcc/spellcheck.c | 22 -- 2 files changed, 14 insertions(+), 14 deletions(-) diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 4fc37369d39..e565d8ecadf 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,9 @@ +2020-05-30 Pip Cet + + * spellcheck.c (test_data): Add problematic strings. + (test_metric_conditions): Don't test the triangle inequality + condition, which our distance function does not satisfy. + 2020-05-28 Nicolas Bértolo * Makefile.in: don't look for libiberty in the "pic" subdirectory diff --git a/gcc/spellcheck.c b/gcc/spellcheck.c index 7891260a258..ab17657e8dc 100644 --- a/gcc/spellcheck.c +++ b/gcc/spellcheck.c @@ -438,13 +438,17 @@ static const char * const test_data[] = { "food", "boo", "1234567890123456789012345678901234567890123456789012345678901234567890" + "abc", + "ac", + "ca", }; /* Verify that get_edit_distance appears to be a sane distance function, - i.e. the conditions for being a metric. This is done directly for a - small set of examples, using test_data above. This is O(N^3) in the size - of the array, due to the test for the triangle inequality, so we keep the - array small. */ + even though it doesn't satisfy the conditions for being a metric. (This + is because the triangle inequality fails to hold: the distance between + "ca" and "ac" is 1, and so is the distance between "abc" and "ac", but + the distance between "abc" and "ca" is 3. Algorithms that calculate the + true Levenshtein-Damerau metric are much more expensive.) */ static void test_metric_conditions () @@ -468,16 +472,6 @@ test_metric_conditions () edit_distance_t dist_ji = get_edit_distance (test_data[j], test_data[i]); ASSERT_EQ (dist_ij, dist_ji); - - /* Triangle inequality. */ - for (int k = 0; k < num_test_cases; k++) - { - edit_distance_t dist_ik - = get_edit_distance (test_data[i], test_data[k]); - edit_distance_t dist_jk - = get_edit_distance (test_data[j], test_data[k]); - ASSERT_TRUE (dist_ik <= dist_ij + dist_jk); - } } } } -- 2.27.0.rc0
Re: [PATCH] Prefer simple case changes in spelling suggestions
On Sat, 2020-05-30 at 13:40 +, Pip Cet via Gcc-patches wrote: > On Fri, May 29, 2020 at 6:21 PM Pip Cet wrote: > > IIRC, minimum string alignment does not satisfy the triangle > > inequality anyway, so test_metric_conditions should probably not > > pretend to test it... > > I did remember correctly, though of course that should have been > "optimal string alignment" :-). If you change the spellcheck.c > test_data array to include "abc", "ac", and "ca", the self-test will > fail, even without your patch. > > I think we should just omit the triangle inequality test from the > self-test, as in the attached patch. I like the idea, but can you update the comment so that it explains why it's not a metric? (better to capture that in a comment in the source rather than just in the mailing list archives). Thanks Dave
Re: [PATCH] Prefer simple case changes in spelling suggestions
On Fri, 2020-05-29 at 10:54 -0600, Tom Tromey wrote: > I got this error message when editing gcc and recompiling: > > ../../gcc/gcc/ada/gcc-interface/decl.c:7714:39: error: > ‘DWARF_GNAT_ENCODINGS_all’ was not declared in this scope; did you > mean ‘DWARF_GNAT_ENCODINGS_GDB’? > 7714 | = debug_info && gnat_encodings == > DWARF_GNAT_ENCODINGS_all; > | ^~~ > ~ > | DWARF_GNAT_ENCODINGS_GD > B > > This suggestion could be improved -- what happened here is that I > failed to upper-case the word, and DWARF_GNAT_ENCODINGS_ALL was the > correct spelling. > > This patch changes gcc's spell checker to prefer simple case changes > when possible. Thanks. I like the overall idea. > I tested this using the self-tests. A new self-test is also > included. Did the full DejaGnu testsuite get run? There are a lot of tests in it that make use of this code. > gcc/ChangeLog: > > * spellcheck.c (CASE_COST): New define. > (BASE_COST): New define. > (get_edit_distance): Recognize case changes. > (get_edit_distance_cutoff): Update. > (test_edit_distances): Update. > (get_old_cutoff): Update. > (test_find_closest_string): Add case sensitivity test. > --- > gcc/spellcheck.c | 114 ++--- > -- > 1 file changed, 74 insertions(+), 40 deletions(-) > diff --git a/gcc/spellcheck.c b/gcc/spellcheck.c > index 7891260a258..9002617453f 100644 > --- a/gcc/spellcheck.c > +++ b/gcc/spellcheck.c [...snip...] The patch should probably update the leading comment to get_edit_distance. > @@ -228,47 +241,50 @@ test_get_edit_distance_both_ways (const char > *a, const char *b, > static void > test_edit_distances () > { > - test_get_edit_distance_both_ways ("", "nonempty", strlen > ("nonempty")); > - test_get_edit_distance_both_ways ("saturday", "sunday", 3); > - test_get_edit_distance_both_ways ("foo", "m_foo", 2); > - test_get_edit_distance_both_ways ("hello_world", "HelloWorld", 3); > + test_get_edit_distance_both_ways ("", "nonempty", > + BASE_COST * strlen ("nonempty")); > + test_get_edit_distance_both_ways ("saturday", "sunday", > + BASE_COST * 3); > + test_get_edit_distance_both_ways ("foo", "m_foo", BASE_COST * 2); > + test_get_edit_distance_both_ways ("hello_world", "HelloWorld", 4); >test_get_edit_distance_both_ways > -("the quick brown fox jumps over the lazy dog", "dog", 40); > +("the quick brown fox jumps over the lazy dog", "dog", BASE_COST > * 40); >test_get_edit_distance_both_ways > ("the quick brown fox jumps over the lazy dog", > "the quick brown dog jumps over the lazy fox", > - 4); > + BASE_COST * 4); >test_get_edit_distance_both_ways > ("Lorem ipsum dolor sit amet, consectetur adipiscing elit,", > "All your base are belong to us", > - 44); > + BASE_COST * 44); >test_get_edit_distance_both_ways ("foo", "FOO", 3); > - test_get_edit_distance_both_ways ("fee", "deed", 2); > - test_get_edit_distance_both_ways ("coorzd1", "coordx1", 2); > - test_get_edit_distance_both_ways ("assert", "sqrt", 3); > - test_get_edit_distance_both_ways ("PATH_MAX", "INT8_MAX", 3); > - test_get_edit_distance_both_ways ("time", "nice", 2); > - test_get_edit_distance_both_ways ("bar", "carg", 2); > + test_get_edit_distance_both_ways ("fee", "deed", BASE_COST * 2); > + test_get_edit_distance_both_ways ("coorzd1", "coordx1", BASE_COST > * 2); > + test_get_edit_distance_both_ways ("assert", "sqrt", BASE_COST * > 3); > + test_get_edit_distance_both_ways ("PATH_MAX", "INT8_MAX", > BASE_COST * 3); > + test_get_edit_distance_both_ways ("time", "nice", BASE_COST * 2); > + test_get_edit_distance_both_ways ("bar", "carg", BASE_COST * 2); >test_get_edit_distance_both_ways ("gtk_widget_show_all", > - "GtkWidgetShowAll", 7); > - test_get_edit_distance_both_ways ("m_bar", "bar", 2); > - test_get_edit_distance_both_ways ("MACRO", "MACRAME", 3); > - test_get_edit_distance_both_ways ("ab", "ac", 1); > - test_get_edit_distance_both_ways ("ab", "a", 1); > - test_get_edit_distance_both_ways ("a", "b", 1); > - test_get_edit_distance_both_ways ("nanl", "name", 2); > - test_get_edit_distance_both_ways ("char", "bar", 2); > - test_get_edit_distance_both_ways ("-optimize", "fsanitize", 5); > - test_get_edit_distance_both_ways ("__DATE__", "__i386__", 4); > + "GtkWidgetShowAll", 10); > + test_get_edit_distance_both_ways ("m_bar", "bar", BASE_COST * 2); > + test_get_edit_distance_both_ways ("MACRO", "MACRAME", BASE_COST * > 3); > + test_get_edit_distance_both_ways ("ab", "ac", BASE_COST * 1); > + test_get_edit_distance_both_ways ("ab", "a", BASE_COST * 1); > + test_get_edit_distance_both_ways ("a", "b", BASE_COST * 1); > + test_get_edit_distance_both_ways
Re: [PATCH] Prefer simple case changes in spelling suggestions
On Fri, May 29, 2020 at 6:21 PM Pip Cet wrote: > IIRC, minimum string alignment does not satisfy the triangle > inequality anyway, so test_metric_conditions should probably not > pretend to test it... I did remember correctly, though of course that should have been "optimal string alignment" :-). If you change the spellcheck.c test_data array to include "abc", "ac", and "ca", the self-test will fail, even without your patch. I think we should just omit the triangle inequality test from the self-test, as in the attached patch. From bbb8b5cd7368f471bcbdbe451591e74315a5adcd Mon Sep 17 00:00:00 2001 From: Pip Cet Date: Sat, 30 May 2020 13:39:09 + Subject: [PATCH] Don't test the triangle inequality in the spellcheck.c self-test. * spellcheck.c (test_data): Add problematic strings. (test_metric_conditions): Don't test the triangle inequality condition, which our distance function does not satisfy. --- gcc/ChangeLog| 6 ++ gcc/spellcheck.c | 19 +-- 2 files changed, 11 insertions(+), 14 deletions(-) diff --git a/gcc/ChangeLog b/gcc/ChangeLog index 4fc37369d39..e565d8ecadf 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,9 @@ +2020-05-30 Pip Cet + + * spellcheck.c (test_data): Add problematic strings. + (test_metric_conditions): Don't test the triangle inequality + condition, which our distance function does not satisfy. + 2020-05-28 Nicolas Bértolo * Makefile.in: don't look for libiberty in the "pic" subdirectory diff --git a/gcc/spellcheck.c b/gcc/spellcheck.c index 7891260a258..1d2df070445 100644 --- a/gcc/spellcheck.c +++ b/gcc/spellcheck.c @@ -438,13 +438,14 @@ static const char * const test_data[] = { "food", "boo", "1234567890123456789012345678901234567890123456789012345678901234567890" + "abc", + "ac", + "ca", }; /* Verify that get_edit_distance appears to be a sane distance function, - i.e. the conditions for being a metric. This is done directly for a - small set of examples, using test_data above. This is O(N^3) in the size - of the array, due to the test for the triangle inequality, so we keep the - array small. */ + even though it doesn't satisfy the conditions for being a metric. This + is done directly for a small set of examples, using test_data above. */ static void test_metric_conditions () @@ -468,16 +469,6 @@ test_metric_conditions () edit_distance_t dist_ji = get_edit_distance (test_data[j], test_data[i]); ASSERT_EQ (dist_ij, dist_ji); - - /* Triangle inequality. */ - for (int k = 0; k < num_test_cases; k++) - { - edit_distance_t dist_ik - = get_edit_distance (test_data[i], test_data[k]); - edit_distance_t dist_jk - = get_edit_distance (test_data[j], test_data[k]); - ASSERT_TRUE (dist_ik <= dist_ij + dist_jk); - } } } } -- 2.27.0.rc0
Re: [PATCH] Prefer simple case changes in spelling suggestions
On Fri, May 29, 2020 at 6:02 PM Tom Tromey wrote: > This patch changes gcc's spell checker to prefer simple case changes > when possible. > > I tested this using the self-tests. A new self-test is also included. I think that's great, but it should be mentioned in the comment that the distance function used is not the Damerau-Levenshtein distance, and not a metric. (No triangle inequality. For example, the distance between "aB" and "ba" is 4, but the distance between "aB" and "ab" is 1 and that between "ab" and "ba" is 2, unless I missed something very clever in your code.) IIRC, minimum string alignment does not satisfy the triangle inequality anyway, so test_metric_conditions should probably not pretend to test it...
[PATCH] Prefer simple case changes in spelling suggestions
I got this error message when editing gcc and recompiling: ../../gcc/gcc/ada/gcc-interface/decl.c:7714:39: error: ‘DWARF_GNAT_ENCODINGS_all’ was not declared in this scope; did you mean ‘DWARF_GNAT_ENCODINGS_GDB’? 7714 | = debug_info && gnat_encodings == DWARF_GNAT_ENCODINGS_all; | ^~~~ | DWARF_GNAT_ENCODINGS_GDB This suggestion could be improved -- what happened here is that I failed to upper-case the word, and DWARF_GNAT_ENCODINGS_ALL was the correct spelling. This patch changes gcc's spell checker to prefer simple case changes when possible. I tested this using the self-tests. A new self-test is also included. gcc/ChangeLog: * spellcheck.c (CASE_COST): New define. (BASE_COST): New define. (get_edit_distance): Recognize case changes. (get_edit_distance_cutoff): Update. (test_edit_distances): Update. (get_old_cutoff): Update. (test_find_closest_string): Add case sensitivity test. --- gcc/spellcheck.c | 114 ++- 1 file changed, 74 insertions(+), 40 deletions(-) diff --git a/gcc/spellcheck.c b/gcc/spellcheck.c index 7891260a258..9002617453f 100644 --- a/gcc/spellcheck.c +++ b/gcc/spellcheck.c @@ -25,6 +25,12 @@ along with GCC; see the file COPYING3. If not see #include "spellcheck.h" #include "selftest.h" +/* Cost of a case transformation. */ +#define CASE_COST 1 + +/* Cost of another kind of edit. */ +#define BASE_COST 2 + /* Get the edit distance between the two strings: the minimal number of edits that are needed to change one string into another, where edits can be one-character insertions, removals, or substitutions, @@ -47,9 +53,9 @@ get_edit_distance (const char *s, int len_s, } if (len_s == 0) -return len_t; +return BASE_COST * len_t; if (len_t == 0) -return len_s; +return BASE_COST * len_s; /* We effectively build a matrix where each (i, j) contains the distance between the prefix strings s[0:j] and t[0:i]. @@ -67,7 +73,7 @@ get_edit_distance (const char *s, int len_s, /* The first row is for the case of an empty target string, which we can reach by deleting every character in the source string. */ for (int i = 0; i < len_s + 1; i++) -v_one_ago[i] = i; +v_one_ago[i] = i * BASE_COST; /* Build successive rows. */ for (int i = 0; i < len_t; i++) @@ -83,21 +89,28 @@ get_edit_distance (const char *s, int len_s, /* The initial column is for the case of an empty source string; we can reach prefixes of the target string of length i by inserting i characters. */ - v_next[0] = i + 1; + v_next[0] = (i + 1) * BASE_COST; /* Build the rest of the row by considering neighbors to the north, west and northwest. */ for (int j = 0; j < len_s; j++) { - edit_distance_t cost = (s[j] == t[i] ? 0 : 1); - edit_distance_t deletion = v_next[j] + 1; - edit_distance_t insertion= v_one_ago[j + 1] + 1; + edit_distance_t cost; + + if (s[j] == t[i]) + cost = 0; + else if (TOLOWER (s[j]) == TOLOWER (t[i])) + cost = CASE_COST; + else + cost = BASE_COST; + edit_distance_t deletion = v_next[j] + BASE_COST; + edit_distance_t insertion= v_one_ago[j + 1] + BASE_COST; edit_distance_t substitution = v_one_ago[j] + cost; edit_distance_t cheapest = MIN (deletion, insertion); cheapest = MIN (cheapest, substitution); if (i > 0 && j > 0 && s[j] == t[i - 1] && s[j - 1] == t[i]) { - edit_distance_t transposition = v_two_ago[j - 1] + 1; + edit_distance_t transposition = v_two_ago[j - 1] + BASE_COST; cheapest = MIN (cheapest, transposition); } v_next[j + 1] = cheapest; @@ -185,11 +198,11 @@ get_edit_distance_cutoff (size_t goal_len, size_t candidate_len) /* If the lengths are close, then round down. */ if (max_length - min_length <= 1) /* ...but allow an edit distance of at least 1. */ -return MAX (max_length / 3, 1); +return BASE_COST * MAX (max_length / 3, 1); /* Otherwise, round up (thus giving a little extra leeway to some cases involving insertions/deletions). */ - return (max_length + 2) / 3; + return BASE_COST * (max_length + 2) / 3; } #if CHECKING_P @@ -228,47 +241,50 @@ test_get_edit_distance_both_ways (const char *a, const char *b, static void test_edit_distances () { - test_get_edit_distance_both_ways ("", "nonempty", strlen ("nonempty")); - test_get_edit_distance_both_ways ("saturday", "sunday", 3); - test_get_edit_distance_both_ways ("foo", "m_foo", 2); - test_get_edit_distance_both_ways ("hello_world", "HelloWorld", 3); + test_get_edit_distance_both_ways ("", "nonempty", +