Re: [PATCH] Prefer simple case changes in spelling suggestions

2020-07-01 Thread Jeff Law via Gcc-patches
On Fri, 2020-06-05 at 18:23 +, Pip Cet via Gcc-patches wrote:
> David Malcolm  writes:
> 
> > On Sat, 2020-05-30 at 18:51 +, Pip Cet wrote:
> > > How's this?
> > 
> > Thanks; looks good to me.  Hopefully this doesn't clash with Tom's
> > patch.
> 
> It doesn't, but I hope I got the commit message right this time.
> 
> (I don't have git access, so if someone could commit this if it's
> approved, that would be great).
Thanks.  Pushed.
jeff



Re: [PATCH] Prefer simple case changes in spelling suggestions

2020-06-05 Thread Pip Cet via Gcc-patches
David Malcolm  writes:

> On Sat, 2020-05-30 at 18:51 +, Pip Cet wrote:
>> How's this?
>
> Thanks; looks good to me.  Hopefully this doesn't clash with Tom's
> patch.

It doesn't, but I hope I got the commit message right this time.

(I don't have git access, so if someone could commit this if it's
approved, that would be great).

>From 7e89a6be601b268b134e20c2702bd55945388a22 Mon Sep 17 00:00:00 2001
From: Pip Cet 
Date: Sat, 30 May 2020 13:39:09 +
Subject: [PATCH] Don't test the triangle inequality in the spellcheck.c
 self-test

The variant of editing distance we use doesn't satisfy the triangle
inequality.

gcc/ChangeLog:

	* spellcheck.c (test_data): Add problematic strings.
	(test_metric_conditions): Don't test the triangle inequality
	condition, which our distance function does not satisfy.
---
 gcc/spellcheck.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/gcc/spellcheck.c b/gcc/spellcheck.c
index 9f7351f364f..45c41d7cef9 100644
--- a/gcc/spellcheck.c
+++ b/gcc/spellcheck.c
@@ -474,13 +474,17 @@ static const char * const test_data[] = {
   "food",
   "boo",
   "1234567890123456789012345678901234567890123456789012345678901234567890"
+  "abc",
+  "ac",
+  "ca",
 };
 
 /* Verify that get_edit_distance appears to be a sane distance function,
-   i.e. the conditions for being a metric.  This is done directly for a
-   small set of examples, using test_data above.  This is O(N^3) in the size
-   of the array, due to the test for the triangle inequality, so we keep the
-   array small.  */
+   even though it doesn't satisfy the conditions for being a metric.  (This
+   is because the triangle inequality fails to hold: the distance between
+   "ca" and "ac" is 1, and so is the distance between "abc" and "ac", but
+   the distance between "abc" and "ca" is 3. Algorithms that calculate the
+   true Levenshtein-Damerau metric are much more expensive.)  */
 
 static void
 test_metric_conditions ()
@@ -504,16 +508,6 @@ test_metric_conditions ()
 	  edit_distance_t dist_ji
 	= get_edit_distance (test_data[j], test_data[i]);
 	  ASSERT_EQ (dist_ij, dist_ji);
-
-	  /* Triangle inequality.  */
-	  for (int k = 0; k < num_test_cases; k++)
-	{
-	  edit_distance_t dist_ik
-		= get_edit_distance (test_data[i], test_data[k]);
-	  edit_distance_t dist_jk
-		= get_edit_distance (test_data[j], test_data[k]);
-	  ASSERT_TRUE (dist_ik <= dist_ij + dist_jk);
-	}
 	}
 }
 }
-- 
2.27.0.rc0



Re: [PATCH] Prefer simple case changes in spelling suggestions

2020-06-02 Thread David Malcolm via Gcc-patches
On Sat, 2020-05-30 at 18:51 +, Pip Cet wrote:
> On Sat, May 30, 2020 at 5:06 PM David Malcolm 
> wrote:
> > On Sat, 2020-05-30 at 13:40 +, Pip Cet via Gcc-patches wrote:
> > > I think we should just omit the triangle inequality test from the
> > > self-test, as in the attached patch.
> > 
> > I like the idea,
> 
> Thanks!
> 
> > but can you update the comment so that it explains why
> > it's not a metric? (better to capture that in a comment in the
> > source
> > rather than just in the mailing list archives).
> 
> How's this?

Thanks; looks good to me.  Hopefully this doesn't clash with Tom's
patch.


Dave



Re: [PATCH] Prefer simple case changes in spelling suggestions

2020-06-02 Thread David Malcolm via Gcc-patches
On Mon, 2020-06-01 at 14:11 -0600, Tom Tromey wrote:
> > Did the full DejaGnu testsuite get run?  There are a lot of tests
> > in it
> > that make use of this code.
> 
> I did "make check" and only saw some XFAILs.
> 
> Here's v2 of the patch, which I think addresses your comments.  I did
> not add a new test of get_edit_distance, because as I mentioned
> earlier,
> an existing test already does what you asked for.
> 
> Tom
> 
> commit e897a99dada8d3935343ebf7b14ad7ec36515b3d
> Author: Tom Tromey 
> Date:   Fri May 29 10:46:57 2020 -0600
> 
> Prefer simple case changes in spelling suggestions
> 
> I got this error message when editing gcc and recompiling:
> 
> ../../gcc/gcc/ada/gcc-interface/decl.c:7714:39: error:
> ‘DWARF_GNAT_ENCODINGS_all’ was not declared in this scope; did you
> mean ‘DWARF_GNAT_ENCODINGS_GDB’?
>  7714 | = debug_info && gnat_encodings ==
> DWARF_GNAT_ENCODINGS_all;
>   |   ^~~
> ~
>   |   DWARF_GNAT_ENCODING
> S_GDB
> 
> This suggestion could be improved -- what happened here is that I
> failed to upper-case the word, and DWARF_GNAT_ENCODINGS_ALL was
> the
> correct spelling.
> 
> This patch changes gcc's spell checker to prefer simple case
> changes
> when possible.
> 
> I tested this using the self-tests.  A new self-test is also
> included.
> 
> gcc/ChangeLog:
> 
> * spellcheck.c (CASE_COST): New define.
> (BASE_COST): New define.
> (get_edit_distance): Recognize case changes.
> (get_edit_distance_cutoff): Update.
> (test_edit_distances): Update.
> (get_old_cutoff): Update.
> (test_find_closest_string): Add case sensitivity test.

Thanks; looks good to me.

Dave
 




Re: [PATCH] Prefer simple case changes in spelling suggestions

2020-06-01 Thread Tom Tromey
> Did the full DejaGnu testsuite get run?  There are a lot of tests in it
> that make use of this code.

I did "make check" and only saw some XFAILs.

Here's v2 of the patch, which I think addresses your comments.  I did
not add a new test of get_edit_distance, because as I mentioned earlier,
an existing test already does what you asked for.

Tom

commit e897a99dada8d3935343ebf7b14ad7ec36515b3d
Author: Tom Tromey 
Date:   Fri May 29 10:46:57 2020 -0600

Prefer simple case changes in spelling suggestions

I got this error message when editing gcc and recompiling:

../../gcc/gcc/ada/gcc-interface/decl.c:7714:39: error: 
‘DWARF_GNAT_ENCODINGS_all’ was not declared in this scope; did you mean 
‘DWARF_GNAT_ENCODINGS_GDB’?
 7714 | = debug_info && gnat_encodings == DWARF_GNAT_ENCODINGS_all;
  |   ^~~~
  |   DWARF_GNAT_ENCODINGS_GDB

This suggestion could be improved -- what happened here is that I
failed to upper-case the word, and DWARF_GNAT_ENCODINGS_ALL was the
correct spelling.

This patch changes gcc's spell checker to prefer simple case changes
when possible.

I tested this using the self-tests.  A new self-test is also included.

gcc/ChangeLog:

* spellcheck.c (CASE_COST): New define.
(BASE_COST): New define.
(get_edit_distance): Recognize case changes.
(get_edit_distance_cutoff): Update.
(test_edit_distances): Update.
(get_old_cutoff): Update.
(test_find_closest_string): Add case sensitivity test.

diff --git a/gcc/spellcheck.c b/gcc/spellcheck.c
index 7891260a258..9f7351f364f 100644
--- a/gcc/spellcheck.c
+++ b/gcc/spellcheck.c
@@ -25,14 +25,22 @@ along with GCC; see the file COPYING3.  If not see
 #include "spellcheck.h"
 #include "selftest.h"
 
+/* Cost of a case transformation.  */
+#define CASE_COST 1
+
+/* Cost of another kind of edit.  */
+#define BASE_COST 2
+
 /* Get the edit distance between the two strings: the minimal
number of edits that are needed to change one string into another,
where edits can be one-character insertions, removals, or substitutions,
or transpositions of two adjacent characters (counting as one "edit").
 
-   This implementation uses the Wagner-Fischer algorithm for the
-   Damerau-Levenshtein distance; specifically, the "optimal string alignment
-   distance" or "restricted edit distance" variant.  */
+   This implementation uses a modified variant of the Wagner-Fischer
+   algorithm for the Damerau-Levenshtein distance; specifically, the
+   "optimal string alignment distance" or "restricted edit distance"
+   variant.  This implementation has been further modified to take
+   case into account.  */
 
 edit_distance_t
 get_edit_distance (const char *s, int len_s,
@@ -47,9 +55,9 @@ get_edit_distance (const char *s, int len_s,
 }
 
   if (len_s == 0)
-return len_t;
+return BASE_COST * len_t;
   if (len_t == 0)
-return len_s;
+return BASE_COST * len_s;
 
   /* We effectively build a matrix where each (i, j) contains the
  distance between the prefix strings s[0:j] and t[0:i].
@@ -67,7 +75,7 @@ get_edit_distance (const char *s, int len_s,
   /* The first row is for the case of an empty target string, which
  we can reach by deleting every character in the source string.  */
   for (int i = 0; i < len_s + 1; i++)
-v_one_ago[i] = i;
+v_one_ago[i] = i * BASE_COST;
 
   /* Build successive rows.  */
   for (int i = 0; i < len_t; i++)
@@ -83,21 +91,28 @@ get_edit_distance (const char *s, int len_s,
   /* The initial column is for the case of an empty source string; we
 can reach prefixes of the target string of length i
 by inserting i characters.  */
-  v_next[0] = i + 1;
+  v_next[0] = (i + 1) * BASE_COST;
 
   /* Build the rest of the row by considering neighbors to
 the north, west and northwest.  */
   for (int j = 0; j < len_s; j++)
{
- edit_distance_t cost = (s[j] == t[i] ? 0 : 1);
- edit_distance_t deletion = v_next[j] + 1;
- edit_distance_t insertion= v_one_ago[j + 1] + 1;
+ edit_distance_t cost;
+
+ if (s[j] == t[i])
+   cost = 0;
+ else if (TOLOWER (s[j]) == TOLOWER (t[i]))
+   cost = CASE_COST;
+ else
+   cost = BASE_COST;
+ edit_distance_t deletion = v_next[j] + BASE_COST;
+ edit_distance_t insertion= v_one_ago[j + 1] + BASE_COST;
  edit_distance_t substitution = v_one_ago[j] + cost;
  edit_distance_t cheapest = MIN (deletion, insertion);
  cheapest = MIN (cheapest, substitution);
  if (i > 0 && j > 0 && s[j] == t[i - 1] && s[j - 1] == t[i])
{
- edit_distance_t transposition = v_two_ago[j - 1] + 1;
+ edit_distance_t 

Re: [PATCH] Prefer simple case changes in spelling suggestions

2020-06-01 Thread Tom Tromey
> "David" == David Malcolm  writes:

>> I tested this using the self-tests.  A new self-test is also
>> included.

> Did the full DejaGnu testsuite get run?  There are a lot of tests in it
> that make use of this code.

I didn't try it, but I can.

> The patch should probably update the leading comment to
> get_edit_distance.

Will do.

>> test_get_edit_distance_both_ways ("foo", "FOO", 3);
[...]

> If I'm reading things correctly, the patch here updates the existing
> tests to apply the BASE_COST scale factor, but I don't think it adds
> any direct checks of the cost of case-conversion.  It would be good to
> add those.

It isn't obvious but the foo/FOO test did change.

Tom


Re: [PATCH] Prefer simple case changes in spelling suggestions

2020-05-30 Thread Pip Cet via Gcc-patches
On Sat, May 30, 2020 at 5:06 PM David Malcolm  wrote:
> On Sat, 2020-05-30 at 13:40 +, Pip Cet via Gcc-patches wrote:
> > I think we should just omit the triangle inequality test from the
> > self-test, as in the attached patch.
>
> I like the idea,

Thanks!

> but can you update the comment so that it explains why
> it's not a metric? (better to capture that in a comment in the source
> rather than just in the mailing list archives).

How's this?
From 2a759aba3639dd782c28c727746c0e6913390fa6 Mon Sep 17 00:00:00 2001
From: Pip Cet 
Date: Sat, 30 May 2020 13:39:09 +
Subject: [PATCH] Don't test the triangle inequality in the spellcheck.c
 self-test.

* spellcheck.c (test_data): Add problematic strings.
(test_metric_conditions): Don't test the triangle inequality
condition, which our distance function does not satisfy.
---
 gcc/ChangeLog|  6 ++
 gcc/spellcheck.c | 22 --
 2 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 4fc37369d39..e565d8ecadf 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,9 @@
+2020-05-30  Pip Cet  
+
+	* spellcheck.c (test_data): Add problematic strings.
+	(test_metric_conditions): Don't test the triangle inequality
+	condition, which our distance function does not satisfy.
+
 2020-05-28  Nicolas Bértolo  
 
 	* Makefile.in: don't look for libiberty in the "pic" subdirectory
diff --git a/gcc/spellcheck.c b/gcc/spellcheck.c
index 7891260a258..ab17657e8dc 100644
--- a/gcc/spellcheck.c
+++ b/gcc/spellcheck.c
@@ -438,13 +438,17 @@ static const char * const test_data[] = {
   "food",
   "boo",
   "1234567890123456789012345678901234567890123456789012345678901234567890"
+  "abc",
+  "ac",
+  "ca",
 };
 
 /* Verify that get_edit_distance appears to be a sane distance function,
-   i.e. the conditions for being a metric.  This is done directly for a
-   small set of examples, using test_data above.  This is O(N^3) in the size
-   of the array, due to the test for the triangle inequality, so we keep the
-   array small.  */
+   even though it doesn't satisfy the conditions for being a metric.  (This
+   is because the triangle inequality fails to hold: the distance between
+   "ca" and "ac" is 1, and so is the distance between "abc" and "ac", but
+   the distance between "abc" and "ca" is 3. Algorithms that calculate the
+   true Levenshtein-Damerau metric are much more expensive.)  */
 
 static void
 test_metric_conditions ()
@@ -468,16 +472,6 @@ test_metric_conditions ()
 	  edit_distance_t dist_ji
 	= get_edit_distance (test_data[j], test_data[i]);
 	  ASSERT_EQ (dist_ij, dist_ji);
-
-	  /* Triangle inequality.  */
-	  for (int k = 0; k < num_test_cases; k++)
-	{
-	  edit_distance_t dist_ik
-		= get_edit_distance (test_data[i], test_data[k]);
-	  edit_distance_t dist_jk
-		= get_edit_distance (test_data[j], test_data[k]);
-	  ASSERT_TRUE (dist_ik <= dist_ij + dist_jk);
-	}
 	}
 }
 }
-- 
2.27.0.rc0



Re: [PATCH] Prefer simple case changes in spelling suggestions

2020-05-30 Thread David Malcolm via Gcc-patches
On Sat, 2020-05-30 at 13:40 +, Pip Cet via Gcc-patches wrote:
> On Fri, May 29, 2020 at 6:21 PM Pip Cet  wrote:
> > IIRC, minimum string alignment does not satisfy the triangle
> > inequality anyway, so test_metric_conditions should probably not
> > pretend to test it...
> 
> I did remember correctly, though of course that should have been
> "optimal string alignment" :-). If you change the spellcheck.c
> test_data array to include "abc", "ac", and "ca", the self-test will
> fail, even without your patch.
> 
> I think we should just omit the triangle inequality test from the
> self-test, as in the attached patch.

I like the idea, but can you update the comment so that it explains why
it's not a metric? (better to capture that in a comment in the source
rather than just in the mailing list archives).

Thanks
Dave



Re: [PATCH] Prefer simple case changes in spelling suggestions

2020-05-30 Thread David Malcolm via Gcc-patches
On Fri, 2020-05-29 at 10:54 -0600, Tom Tromey wrote:
> I got this error message when editing gcc and recompiling:
> 
> ../../gcc/gcc/ada/gcc-interface/decl.c:7714:39: error:
> ‘DWARF_GNAT_ENCODINGS_all’ was not declared in this scope; did you
> mean ‘DWARF_GNAT_ENCODINGS_GDB’?
>  7714 | = debug_info && gnat_encodings ==
> DWARF_GNAT_ENCODINGS_all;
>   |   ^~~
> ~
>   |   DWARF_GNAT_ENCODINGS_GD
> B
> 
> This suggestion could be improved -- what happened here is that I
> failed to upper-case the word, and DWARF_GNAT_ENCODINGS_ALL was the
> correct spelling.
> 
> This patch changes gcc's spell checker to prefer simple case changes
> when possible.

Thanks.  I like the overall idea.

> I tested this using the self-tests.  A new self-test is also
> included.

Did the full DejaGnu testsuite get run?  There are a lot of tests in it
that make use of this code.

> gcc/ChangeLog:
> 
>   * spellcheck.c (CASE_COST): New define.
>   (BASE_COST): New define.
>   (get_edit_distance): Recognize case changes.
>   (get_edit_distance_cutoff): Update.
>   (test_edit_distances): Update.
>   (get_old_cutoff): Update.
>   (test_find_closest_string): Add case sensitivity test.
> ---
>  gcc/spellcheck.c | 114 ++---
> --
>  1 file changed, 74 insertions(+), 40 deletions(-)

> diff --git a/gcc/spellcheck.c b/gcc/spellcheck.c
> index 7891260a258..9002617453f 100644
> --- a/gcc/spellcheck.c
> +++ b/gcc/spellcheck.c

[...snip...]

The patch should probably update the leading comment to
get_edit_distance.

> @@ -228,47 +241,50 @@ test_get_edit_distance_both_ways (const char
> *a, const char *b,
>  static void
>  test_edit_distances ()
>  {
> -  test_get_edit_distance_both_ways ("", "nonempty", strlen
> ("nonempty"));
> -  test_get_edit_distance_both_ways ("saturday", "sunday", 3);
> -  test_get_edit_distance_both_ways ("foo", "m_foo", 2);
> -  test_get_edit_distance_both_ways ("hello_world", "HelloWorld", 3);
> +  test_get_edit_distance_both_ways ("", "nonempty",
> + BASE_COST * strlen ("nonempty"));
> +  test_get_edit_distance_both_ways ("saturday", "sunday",
> + BASE_COST * 3);
> +  test_get_edit_distance_both_ways ("foo", "m_foo", BASE_COST * 2);
> +  test_get_edit_distance_both_ways ("hello_world", "HelloWorld", 4);
>test_get_edit_distance_both_ways
> -("the quick brown fox jumps over the lazy dog", "dog", 40);
> +("the quick brown fox jumps over the lazy dog", "dog", BASE_COST
> * 40);
>test_get_edit_distance_both_ways
>  ("the quick brown fox jumps over the lazy dog",
>   "the quick brown dog jumps over the lazy fox",
> - 4);
> + BASE_COST * 4);
>test_get_edit_distance_both_ways
>  ("Lorem ipsum dolor sit amet, consectetur adipiscing elit,",
>   "All your base are belong to us",
> - 44);
> + BASE_COST * 44);
>test_get_edit_distance_both_ways ("foo", "FOO", 3);
> -  test_get_edit_distance_both_ways ("fee", "deed", 2);
> -  test_get_edit_distance_both_ways ("coorzd1", "coordx1", 2);
> -  test_get_edit_distance_both_ways ("assert", "sqrt", 3);
> -  test_get_edit_distance_both_ways ("PATH_MAX", "INT8_MAX", 3);
> -  test_get_edit_distance_both_ways ("time", "nice", 2);
> -  test_get_edit_distance_both_ways ("bar", "carg", 2);
> +  test_get_edit_distance_both_ways ("fee", "deed", BASE_COST * 2);
> +  test_get_edit_distance_both_ways ("coorzd1", "coordx1", BASE_COST
> * 2);
> +  test_get_edit_distance_both_ways ("assert", "sqrt", BASE_COST *
> 3);
> +  test_get_edit_distance_both_ways ("PATH_MAX", "INT8_MAX",
> BASE_COST * 3);
> +  test_get_edit_distance_both_ways ("time", "nice", BASE_COST * 2);
> +  test_get_edit_distance_both_ways ("bar", "carg", BASE_COST * 2);
>test_get_edit_distance_both_ways ("gtk_widget_show_all",
> - "GtkWidgetShowAll", 7);
> -  test_get_edit_distance_both_ways ("m_bar", "bar", 2);
> -  test_get_edit_distance_both_ways ("MACRO", "MACRAME", 3);
> -  test_get_edit_distance_both_ways ("ab", "ac", 1);
> -  test_get_edit_distance_both_ways ("ab", "a", 1);
> -  test_get_edit_distance_both_ways ("a", "b", 1);
> -  test_get_edit_distance_both_ways ("nanl", "name", 2);
> -  test_get_edit_distance_both_ways ("char", "bar", 2);
> -  test_get_edit_distance_both_ways ("-optimize", "fsanitize", 5);
> -  test_get_edit_distance_both_ways ("__DATE__", "__i386__", 4);
> + "GtkWidgetShowAll", 10);
> +  test_get_edit_distance_both_ways ("m_bar", "bar", BASE_COST * 2);
> +  test_get_edit_distance_both_ways ("MACRO", "MACRAME", BASE_COST *
> 3);
> +  test_get_edit_distance_both_ways ("ab", "ac", BASE_COST * 1);
> +  test_get_edit_distance_both_ways ("ab", "a", BASE_COST * 1);
> +  test_get_edit_distance_both_ways ("a", "b", BASE_COST * 1);
> +  test_get_edit_distance_both_ways 

Re: [PATCH] Prefer simple case changes in spelling suggestions

2020-05-30 Thread Pip Cet via Gcc-patches
On Fri, May 29, 2020 at 6:21 PM Pip Cet  wrote:
> IIRC, minimum string alignment does not satisfy the triangle
> inequality anyway, so test_metric_conditions should probably not
> pretend to test it...

I did remember correctly, though of course that should have been
"optimal string alignment" :-). If you change the spellcheck.c
test_data array to include "abc", "ac", and "ca", the self-test will
fail, even without your patch.

I think we should just omit the triangle inequality test from the
self-test, as in the attached patch.
From bbb8b5cd7368f471bcbdbe451591e74315a5adcd Mon Sep 17 00:00:00 2001
From: Pip Cet 
Date: Sat, 30 May 2020 13:39:09 +
Subject: [PATCH] Don't test the triangle inequality in the spellcheck.c
 self-test.

* spellcheck.c (test_data): Add problematic strings.
(test_metric_conditions): Don't test the triangle inequality
condition, which our distance function does not satisfy.
---
 gcc/ChangeLog|  6 ++
 gcc/spellcheck.c | 19 +--
 2 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 4fc37369d39..e565d8ecadf 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,9 @@
+2020-05-30  Pip Cet  
+
+	* spellcheck.c (test_data): Add problematic strings.
+	(test_metric_conditions): Don't test the triangle inequality
+	condition, which our distance function does not satisfy.
+
 2020-05-28  Nicolas Bértolo  
 
 	* Makefile.in: don't look for libiberty in the "pic" subdirectory
diff --git a/gcc/spellcheck.c b/gcc/spellcheck.c
index 7891260a258..1d2df070445 100644
--- a/gcc/spellcheck.c
+++ b/gcc/spellcheck.c
@@ -438,13 +438,14 @@ static const char * const test_data[] = {
   "food",
   "boo",
   "1234567890123456789012345678901234567890123456789012345678901234567890"
+  "abc",
+  "ac",
+  "ca",
 };
 
 /* Verify that get_edit_distance appears to be a sane distance function,
-   i.e. the conditions for being a metric.  This is done directly for a
-   small set of examples, using test_data above.  This is O(N^3) in the size
-   of the array, due to the test for the triangle inequality, so we keep the
-   array small.  */
+   even though it doesn't satisfy the conditions for being a metric.  This
+   is done directly for a small set of examples, using test_data above.  */
 
 static void
 test_metric_conditions ()
@@ -468,16 +469,6 @@ test_metric_conditions ()
 	  edit_distance_t dist_ji
 	= get_edit_distance (test_data[j], test_data[i]);
 	  ASSERT_EQ (dist_ij, dist_ji);
-
-	  /* Triangle inequality.  */
-	  for (int k = 0; k < num_test_cases; k++)
-	{
-	  edit_distance_t dist_ik
-		= get_edit_distance (test_data[i], test_data[k]);
-	  edit_distance_t dist_jk
-		= get_edit_distance (test_data[j], test_data[k]);
-	  ASSERT_TRUE (dist_ik <= dist_ij + dist_jk);
-	}
 	}
 }
 }
-- 
2.27.0.rc0



Re: [PATCH] Prefer simple case changes in spelling suggestions

2020-05-29 Thread Pip Cet via Gcc-patches
On Fri, May 29, 2020 at 6:02 PM Tom Tromey  wrote:
> This patch changes gcc's spell checker to prefer simple case changes
> when possible.
>
> I tested this using the self-tests.  A new self-test is also included.

I think that's great, but it should be mentioned in the comment that
the distance function used is not the Damerau-Levenshtein distance,
and not a metric.

(No triangle inequality. For example, the distance between "aB" and
"ba" is 4, but the distance between "aB" and "ab" is 1 and that
between "ab" and "ba" is 2, unless I missed something very clever in
your code.)

IIRC, minimum string alignment does not satisfy the triangle
inequality anyway, so test_metric_conditions should probably not
pretend to test it...


[PATCH] Prefer simple case changes in spelling suggestions

2020-05-29 Thread Tom Tromey
I got this error message when editing gcc and recompiling:

../../gcc/gcc/ada/gcc-interface/decl.c:7714:39: error: 
‘DWARF_GNAT_ENCODINGS_all’ was not declared in this scope; did you mean 
‘DWARF_GNAT_ENCODINGS_GDB’?
 7714 | = debug_info && gnat_encodings == DWARF_GNAT_ENCODINGS_all;
  |   ^~~~
  |   DWARF_GNAT_ENCODINGS_GDB

This suggestion could be improved -- what happened here is that I
failed to upper-case the word, and DWARF_GNAT_ENCODINGS_ALL was the
correct spelling.

This patch changes gcc's spell checker to prefer simple case changes
when possible.

I tested this using the self-tests.  A new self-test is also included.

gcc/ChangeLog:

* spellcheck.c (CASE_COST): New define.
(BASE_COST): New define.
(get_edit_distance): Recognize case changes.
(get_edit_distance_cutoff): Update.
(test_edit_distances): Update.
(get_old_cutoff): Update.
(test_find_closest_string): Add case sensitivity test.
---
 gcc/spellcheck.c | 114 ++-
 1 file changed, 74 insertions(+), 40 deletions(-)

diff --git a/gcc/spellcheck.c b/gcc/spellcheck.c
index 7891260a258..9002617453f 100644
--- a/gcc/spellcheck.c
+++ b/gcc/spellcheck.c
@@ -25,6 +25,12 @@ along with GCC; see the file COPYING3.  If not see
 #include "spellcheck.h"
 #include "selftest.h"
 
+/* Cost of a case transformation.  */
+#define CASE_COST 1
+
+/* Cost of another kind of edit.  */
+#define BASE_COST 2
+
 /* Get the edit distance between the two strings: the minimal
number of edits that are needed to change one string into another,
where edits can be one-character insertions, removals, or substitutions,
@@ -47,9 +53,9 @@ get_edit_distance (const char *s, int len_s,
 }
 
   if (len_s == 0)
-return len_t;
+return BASE_COST * len_t;
   if (len_t == 0)
-return len_s;
+return BASE_COST * len_s;
 
   /* We effectively build a matrix where each (i, j) contains the
  distance between the prefix strings s[0:j] and t[0:i].
@@ -67,7 +73,7 @@ get_edit_distance (const char *s, int len_s,
   /* The first row is for the case of an empty target string, which
  we can reach by deleting every character in the source string.  */
   for (int i = 0; i < len_s + 1; i++)
-v_one_ago[i] = i;
+v_one_ago[i] = i * BASE_COST;
 
   /* Build successive rows.  */
   for (int i = 0; i < len_t; i++)
@@ -83,21 +89,28 @@ get_edit_distance (const char *s, int len_s,
   /* The initial column is for the case of an empty source string; we
 can reach prefixes of the target string of length i
 by inserting i characters.  */
-  v_next[0] = i + 1;
+  v_next[0] = (i + 1) * BASE_COST;
 
   /* Build the rest of the row by considering neighbors to
 the north, west and northwest.  */
   for (int j = 0; j < len_s; j++)
{
- edit_distance_t cost = (s[j] == t[i] ? 0 : 1);
- edit_distance_t deletion = v_next[j] + 1;
- edit_distance_t insertion= v_one_ago[j + 1] + 1;
+ edit_distance_t cost;
+
+ if (s[j] == t[i])
+   cost = 0;
+ else if (TOLOWER (s[j]) == TOLOWER (t[i]))
+   cost = CASE_COST;
+ else
+   cost = BASE_COST;
+ edit_distance_t deletion = v_next[j] + BASE_COST;
+ edit_distance_t insertion= v_one_ago[j + 1] + BASE_COST;
  edit_distance_t substitution = v_one_ago[j] + cost;
  edit_distance_t cheapest = MIN (deletion, insertion);
  cheapest = MIN (cheapest, substitution);
  if (i > 0 && j > 0 && s[j] == t[i - 1] && s[j - 1] == t[i])
{
- edit_distance_t transposition = v_two_ago[j - 1] + 1;
+ edit_distance_t transposition = v_two_ago[j - 1] + BASE_COST;
  cheapest = MIN (cheapest, transposition);
}
  v_next[j + 1] = cheapest;
@@ -185,11 +198,11 @@ get_edit_distance_cutoff (size_t goal_len, size_t 
candidate_len)
   /* If the lengths are close, then round down.  */
   if (max_length - min_length <= 1)
 /* ...but allow an edit distance of at least 1.  */
-return MAX (max_length / 3, 1);
+return BASE_COST * MAX (max_length / 3, 1);
 
   /* Otherwise, round up (thus giving a little extra leeway to some cases
  involving insertions/deletions).  */
-  return (max_length + 2) / 3;
+  return BASE_COST * (max_length + 2) / 3;
 }
 
 #if CHECKING_P
@@ -228,47 +241,50 @@ test_get_edit_distance_both_ways (const char *a, const 
char *b,
 static void
 test_edit_distances ()
 {
-  test_get_edit_distance_both_ways ("", "nonempty", strlen ("nonempty"));
-  test_get_edit_distance_both_ways ("saturday", "sunday", 3);
-  test_get_edit_distance_both_ways ("foo", "m_foo", 2);
-  test_get_edit_distance_both_ways ("hello_world", "HelloWorld", 3);
+  test_get_edit_distance_both_ways ("", "nonempty",
+