Re: improvements in Morfologik speller

Jaume Ortolà i Font Mon, 08 Jun 2015 12:28:25 -0700

2015-06-08 9:39 GMT+02:00 Daniel Naber <daniel.na...@languagetool.org>:

> On 2015-06-02 15:06, Jaume Ortolà i Font wrote:
>
> Hi Jaume,
>
> sorry for the late reply.
>
> > There are some failures with the current German LanguageTool tests.
> > Could you take a look, Daniel? You need to use replacements in
> > lower-case (r rh, rh r). Are the results reasonable?
>
> This case looks like a regression to me:
>
> Not found: 'Haus' in: [Hauch, Hau, Haue, Haut, -Au, -Aue, -Aug, -Haus,
> -Haut, Ahaus, Back, Baku, Bank, Bark, Bau, Bau-, Baud, Baum, Baus,
> Chauke]
>
> As long as there's a suggestion with a distance of 1, shouldn't it be
> preferred over suggestions with a distance of 2?
>
> For the case "Ligafußboll", the suggestion with a distance of 2 seems to
> be lost, I think that shouldn't be the case:
>
> Expected :[Ligafußball, Ligafußballs]
> Actual   :[Ligafußball]
>

You are right. These results are not expected. I will look at them again.

A question: "Ligafußball" doesn't exist as a word in the dictionary. It's a
compound, isn't it?

> > If the preferred option in German is convert-case=false, then my
> > changes will not affect the German tests in any way.
>
> Could you describe what exactly convert-case does, I'm not sure I
> completely understand it.
>

It is the same for replacement-pairs, convert-case and ignore-diacritics.
If any of these features is enabled, then these differences add a distance
of 0 between the original word and the possible suggestion.

Examples:
If "ss ß" is in replacement-pairs, the distance between Ligafussball
(original wrong word) and Ligafußball (suggestion) is zero.
If convert-case=true, the distance between ligafußball (original word) and
Ligafußball (suggestion) is zero.
If ignore-diacritics=true, the distance between horen (original word) and
hören (suggestion) is zero.
If ignore-diacritics=true, the distance between horem (original word) and
hören (suggestion) is one (not two).

In the file de_DE.info you wrote:
# ignore-diacritics=false speeds up building the suggestions by a factor of
about 2:

Is that true with the current Speller code?

A question for Marcin:
As you can see here [1], the condition isConvertingCase() is inside the
condition isIgnoringDiacritics(), so they are not independent. Was it made
on purpose? Should we correct it?

Currently, for German, convert-case is true (by default) and
ignore-diacritics is false. "convert-case=true" is necessary for
capitalized words (for example, at the start of a sentence) not to be
marked as errors. But when the Speller looks for suggestions, as
ignore-diacritics=false, the condition convert-case=true is ignored.

Regards,
Jaume

[1]
https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-speller/src/main/java/morfologik/speller/Speller.java#L601

------------------------------------------------------------------------------

_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: improvements in Morfologik speller

Reply via email to