Re: Improving suggestions in speller rules

Marcin Miłkowski Mon, 22 Apr 2013 14:45:39 -0700

W dniu 2013-04-22 01:27, Jaume Ortolà i Font pisze:
> 2013/4/21 Marcin Miłkowski <[email protected] <mailto:[email protected]>>
>
>     Yes, you are creating a Pattern (implicitly) because you're using
>     replaceAll with a regexp. It would be much faster if you simply used the
>     test
>
>     Character.GetType(xn.charAt(i))==COMBINING_SPACING_MARK
>
>     to discard diacritical characters when computing the distance.
>
>
> OK. I see. After using the Normalizer we just need to do this:
>
> return xn.charAt(0)==yn.charAt(0);


Hm, the normalizer should put the combining marks after the first 
character, so yes, this should be enough. Nice idea!

>
>     Hm, I'm not sure that we want to have a case-insensitive comparison.
>     This might be language-dependent, but overall this constitutes an edit
>     distance = 1 in many cases.
>
>
>   The case-insensitive comparison seems reasonable to me. A case
> conversion will be almost always a good suggesion and very probably the
> best suggestion, I think.

But it's still an edit distance. For some languages, there are essential 
differences between uppercase and lowercase words (think of German). 
This should be a parameter in the .info file.

>
>     Now, for Pec_A the distance to Pecra is 1 ('A' == 'a' if you use
>     case-insensitive comparison, so only '_' != 'r'). So if you have Pec_A
>     in the dictionary, and you don't have Pecra, this is not a bug, this is
>     what happens with case-insensitivity.
>
>
> There is indeed a bug.
>
> In the Catalan dictionary there are these words:
>
> Pebrades Pebrades NPFPG00
> Pec Pec NPCNSP0
> Pecos Pecos NPCSG00
>
> When searching suggestions for the wrongly spelled word "Pecra"
> (case-insensitive), this is what happens:
>
> Candidate - depth
> Pebrad 5
> Pecrad 2
> Pec_ad 3 !!
> Pec_Ad 4 !! -> which is accepted as a candidate. I don't know where the
> uppercase A comes from.

Is there Pec_ad at all in the dictionary?

>
> The algorithm should stop searching when the separator ('_') appears (or
> before it appears), but it only stops when two errors are found
> (distance=2), that is "_A", which in case-insensitive comparison is
> sometimes only one error.

Do you have '_' as the separator in your dictionary? I can see that the 
separator is defined as '+' for the spelling dictionary. Note that I did 
not test the speller on files with separators at all (just on pure word 
lists).

Best,
Marcin

>
> Best,
> Jaume
>
>
>
> ------------------------------------------------------------------------------
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
>
>
>
> _______________________________________________
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>


------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: Improving suggestions in speller rules

Reply via email to