W dniu 2013-04-22 01:27, Jaume Ortolà i Font pisze:
> 2013/4/21 Marcin Miłkowski <[email protected] <mailto:[email protected]>>
>
> Yes, you are creating a Pattern (implicitly) because you're using
> replaceAll with a regexp. It would be much faster if you simply used the
> test
>
> Character.GetType(xn.charAt(i))==COMBINING_SPACING_MARK
>
> to discard diacritical characters when computing the distance.
>
>
> OK. I see. After using the Normalizer we just need to do this:
>
> return xn.charAt(0)==yn.charAt(0);
Hm, the normalizer should put the combining marks after the first
character, so yes, this should be enough. Nice idea!
>
> Hm, I'm not sure that we want to have a case-insensitive comparison.
> This might be language-dependent, but overall this constitutes an edit
> distance = 1 in many cases.
>
>
> The case-insensitive comparison seems reasonable to me. A case
> conversion will be almost always a good suggesion and very probably the
> best suggestion, I think.
But it's still an edit distance. For some languages, there are essential
differences between uppercase and lowercase words (think of German).
This should be a parameter in the .info file.
>
> Now, for Pec_A the distance to Pecra is 1 ('A' == 'a' if you use
> case-insensitive comparison, so only '_' != 'r'). So if you have Pec_A
> in the dictionary, and you don't have Pecra, this is not a bug, this is
> what happens with case-insensitivity.
>
>
> There is indeed a bug.
>
> In the Catalan dictionary there are these words:
>
> Pebrades Pebrades NPFPG00
> Pec Pec NPCNSP0
> Pecos Pecos NPCSG00
>
> When searching suggestions for the wrongly spelled word "Pecra"
> (case-insensitive), this is what happens:
>
> Candidate - depth
> Pebrad 5
> Pecrad 2
> Pec_ad 3 !!
> Pec_Ad 4 !! -> which is accepted as a candidate. I don't know where the
> uppercase A comes from.
Is there Pec_ad at all in the dictionary?
>
> The algorithm should stop searching when the separator ('_') appears (or
> before it appears), but it only stops when two errors are found
> (distance=2), that is "_A", which in case-insensitive comparison is
> sometimes only one error.
Do you have '_' as the separator in your dictionary? I can see that the
separator is defined as '+' for the spelling dictionary. Note that I did
not test the speller on files with separators at all (just on pure word
lists).
Best,
Marcin
>
> Best,
> Jaume
>
>
>
> ------------------------------------------------------------------------------
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
>
>
>
> _______________________________________________
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel