Re: Improving suggestions in speller rules

Marcin Miłkowski Sun, 21 Apr 2013 10:40:18 -0700

Hi,


W dniu 2013-04-20 21:38, Jaume Ortolà i Font pisze:
> Hi,Â
>
> As Marcin said, single character-multiple characterÂ substitutions are
> tricky in the algorithm. So far I have come up with no solution.Â

Hm, you need something else for this. In Jan's code, we have a very 
simplistic table for multiple characters (basically two) being 
equivalent to a single one. Now, there is a function match_candidate, 
which is used to test whether letters match, and it's defined this way:

/* Name:        match_candidate
  * Class:      spell_fsa
  * Purpose:    Match two last letters of the candidate against the penultimate
  *             letter of the word.
  * Parameters: i               - (i) current index of the word;
  *             j               - (j) current index of the candidate.
  * Result:     TRUE if letters match, FALSE otherwise.
  * Remarks:    first_column is a vector of strings with indices being
  *             characters. The strings are normally empty, but for single
  *             letters specified in the first column in the character class
  *             files, they contain one or more equivalent pairs of characters.
  */
int
spell_fsa::match_candidate(const int i, const int j)
{
   char *c;
   char c0, c1;

   if (i > 0 && (c = first_column[(unsigned char)(word_ff[i-1])]) && j > 0)
     for (c0 = candidate[j-1], c1 = candidate[j]; *c; c+= 2)
       if (c[0] == c0 && c[1] == c1)
        return TRUE;
   return FALSE;
}//spell_fsa::match_candidate

The encoding of the character table is pretty obvious, and the function 
is used


if (match_candidate(word_index, cand_index)) {
       // The last two letters from candidate, and the previous letter
       // from word_ff match
find_repl(depth, next_node, word_index, cand_index + 1);

Also, there's a function match_word that ensures that we have a matching 
word. This is how it's used in find_repl:

if (m_abs(word_length - 1 - depth) <= e_d &&
            next_node.is_final() &&
            (word_length > 2 && match_word(word_length - 2, cand_index) &&
             (dist = ed(word_length - 3 - (word_index - depth), depth - 1,
                        word_length - 3, cand_index -1) + 1) <= e_d)) {
          word_found.list_item = nstrdup(candidate);
          word_found.dist = dist;
          word_found.cost = dist;               // for the moment
          results.insert_sorted(&word_found);
        }

And it's defined this way:

/* Name:        match_word
  * Class:      spell_fsa
  * Purpose:    Match this and the next letter of the word against the last
  *             letter of the candidate.
  * Parameters: i               - (i) current index of the word;
  *             j               - (i) current index of the candidate.
  * Result:     TRUE if letters match, FALSE otherwise.
  * Remarks:    second_column is a vector of strings with indices being
  *             characters. The strings are normally empty, but for single
  *             letters specified in the second column in the character class
  *             files, they contain one or more equivalent pairs of characters.
  */
int
spell_fsa::match_word(const int i, const int j)
{
   char *c;
   char c0, c1;

   if ((c = second_column[(unsigned char)(candidate[j])]))
     for (c0 = word_ff[i], c1 = word_ff[i+1]; *c; c += 2)
       if (c[0] == c0 && c[1] == c1)
        return TRUE;
   return FALSE;
}//spell_fsa::match_word

Maybe this would help. There must be a way to make it more general. 
Basically, the idea is to replace strings before we even go to computing 
edit distances.


>
> For the rest of cases I explained before, I think there is an easy and
> general solution. When comparing characters at a given depth between the
> original word and the candidate in ed(), this comparison can be made
> case-insensitive and "diacritics-insensitive". This way we'll get more
> proper suggestions. And furthermore if the substitutions are only case
> changes or addition/removal of diacritics, then the total distance will
> be zero, and these suggestions will appear in the first positions, as
> desired. Is this valid for all languages?
>
> I attach Speller.java with my implementation (search for "by Jaume
> Ortola"). Probably it can be implemented in someÂ more efficientÂ way.

Yes, you are creating a Pattern (implicitly) because you're using 
replaceAll with a regexp. It would be much faster if you simply used the 
test

Character.GetType(xn.charAt(i))==COMBINING_SPACING_MARK

to discard diacritical characters when computing the distance.

Also, we should use the same function for computing transposition etc. I 
like this idea, it's quite general. It seems it could work.

>
> When using case-insensitive character comparison, there is a bug. But I
> think it already existed previously. Some strings ending in "_A" are
> accepted as candidates. For example, "Pec_A" is a suggestion for
> "Pecra". These suggestions can be easily discarded, but the bug probably
> indicates that something is not done properly in the algorithm. The
> problem disappears when you use case-sensitive character comparison.

Hm, I'm not sure that we want to have a case-insensitive comparison. 
This might be language-dependent, but overall this constitutes an edit 
distance = 1 in many cases.

Now, for Pec_A the distance to Pecra is 1 ('A' == 'a' if you use 
case-insensitive comparison, so only '_' != 'r'). So if you have Pec_A 
in the dictionary, and you don't have Pecra, this is not a bug, this is 
what happens with case-insensitivity.

Best,
Marcin

> Best,
> Jaume OrtolÃ
>
>
> Salutacions,
> Jaume OrtolÃ
> www.riuraueditors.cat <http://www.riuraueditors.cat>
>
>
>
> 2013/4/18 Marcin MiÅ‚kowski <[email protected] <mailto:[email protected]>>
>
>     W dniu 2013-04-18 16:28, Daniel Naber pisze:
>      > On 18.04.2013, 14:41:21 Jaume OrtolÃ  i Font wrote:
>      >
>      > Hi Jaume,
>      >
>      >> For achieving this, I think that some changes in a word should be
>      >> considered as representing a lesser "distance" from the original
>     word
>      >> than others. In Catalan, for example, these changes could be:
>      >
>      > the right approach is to add this into the algorithm that
>     traverses the
>      > dictionary tree. For German, I needed a solution fast and ended
>     up with a
>      > hack in GermanSpellerRule. It's easy to understand, but if you
>     could check
>      > the morfologik algorithm we use and improve that, it would be
>     great. Marcin
>      > can probably give some directions I think.
>
>     Yes, Daniel is right. The dictionary tree is traversed quite
>     straightforwardly; what you want is to prefer some paths not only based
>     on simple distance, as we do now, but simply based on similarities. For
>     that, fsa_spell uses a simple replacement table, which is easy for
>     character-for-character substitution but does not work for
>     character-multiple characters substitution. We simply need something
>     more general, and we can generate candidate words during tree traversal.
>     The code you need to change belongs to morfologik-speller, in Speller
>     class, in particular findRepl. The idea of findRepl is quite easy: for a
>     given depth (=position in the string) we try to find an entry in
>     dictionary (arc) that is not too far in terms of the edit distance. If
>     we still have edit distance < limit, and we find that the arc is
>     terminal (no other characters in the word), then we add a candidate;
>     otherwise, we move one character and try to find another candidate. This
>     is all recursive, so we end up with multiple candidates.
>
>     findRepl simply uses a loop to go through all arcs in the dictionary and
>     stops traversing if the edit distance is too far. There are basically
>     two ways of changing this behavior:
>
>     (a) change the way edit distance is calculated in cuted(), by allowing
>     for more flexible replacements, but this is quite tricky;
>
>     (b) do not call cuted for single character-multiple character
>     substitutions in a given substitution table;
>
>     (c) prepare the list of all possible substitutions in the original word
>     and use the original findRepl (similar to your idea, basically); but I'm
>     not quite sure if this way, we would really find all candidates we want
>     to find.
>
>     I think we need to focus on (a) or (b); maybe an additional parameter to
>     cuted might tell the function to treat multiple character replacement as
>     a single character replacement...
>
>     Now, cuted() uses a very smart (not mine!) way of precomputing of edit
>     distances in a matrix, based on Jan Daciuk's improvements to Oflazer's
>     algorithm. But there is a bug in the original algorithm (the matrix H
>     that represents a matrix has wrong dimensions) and Jan does not have
>     time to remove it. I don't know how to fix it - I made a dirty trick
>     (look for "FIXME" in the code). This is probably quite trivial but I
>     don't have time to reread Jan's code and Oflazer's paper.
>
>     Best,
>     Marcin
>
>     
> ------------------------------------------------------------------------------
>     Precog is a next-generation analytics platform capable of advanced
>     analytics on semi-structured data. The platform includes APIs for
>     building
>     apps and a phenomenal toolset for data science. Developers can use
>     our toolset for easy data analysis & visualization. Get a free account!
>     http://www2.precog.com/precogplatform/slashdotnewsletter
>     _______________________________________________
>     Languagetool-devel mailing list
>     [email protected]
>     <mailto:[email protected]>
>     https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
>
>
>
> ------------------------------------------------------------------------------
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
>
>
>
> _______________________________________________
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>


------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: Improving suggestions in speller rules

Reply via email to