2013/4/21 Marcin Miłkowski <[email protected]>
> Yes, you are creating a Pattern (implicitly) because you're using
> replaceAll with a regexp. It would be much faster if you simply used the
> test
>
> Character.GetType(xn.charAt(i))==COMBINING_SPACING_MARK
>
> to discard diacritical characters when computing the distance.
>
OK. I see. After using the Normalizer we just need to do this:
return xn.charAt(0)==yn.charAt(0);
> Hm, I'm not sure that we want to have a case-insensitive comparison.
> This might be language-dependent, but overall this constitutes an edit
> distance = 1 in many cases.
>
The case-insensitive comparison seems reasonable to me. A case conversion
will be almost always a good suggesion and very probably the best
suggestion, I think.
> Now, for Pec_A the distance to Pecra is 1 ('A' == 'a' if you use
> case-insensitive comparison, so only '_' != 'r'). So if you have Pec_A
> in the dictionary, and you don't have Pecra, this is not a bug, this is
> what happens with case-insensitivity.
>
There is indeed a bug.
In the Catalan dictionary there are these words:
Pebrades Pebrades NPFPG00
Pec Pec NPCNSP0
Pecos Pecos NPCSG00
When searching suggestions for the wrongly spelled word "Pecra"
(case-insensitive), this is what happens:
Candidate - depth
Pebrad 5
Pecrad 2
Pec_ad 3 !!
Pec_Ad 4 !! -> which is accepted as a candidate. I don't know where the
uppercase A comes from.
The algorithm should stop searching when the separator ('_') appears (or
before it appears), but it only stops when two errors are found
(distance=2), that is "_A", which in case-insensitive comparison is
sometimes only one error.
Best,
Jaume
------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel