Re: [Languagetool] How to enable spellchecking?

Dominique Pellé Tue, 05 Jun 2012 14:00:06 -0700

Marcin Miłkowski <list-addr...@wp.pl> wrote:

> W dniu 2012-06-04 22:51, Dominique Pellé pisze:
> > Hi
> >
> > Another problem with spell checking, is the quality
> > of the Esperanto Hunspell dictionary.  It's not good
> > enough. Too many correct words are highlighted
> > because they are missing in the dictionary. That's
> > not LT's faults here.
>
> OK, I disabled HunspellRule for Esperanto.


Good.


> > Given all the unresolved issues at least in all the languages
> > that I maintain (br, fr, eo), can we consider turning
> > Hunspell off by default? I'm concerned that people
> > downloading the nightly build will experience many
> > spurious errors.
>
> Well, I fixed errors for Breton, and there's a fix for French, so maybe
> it's not a big problem?

Thanks.  Breton spelling checker looks much better now.

French is better. However, for French, it only accepts the ASCII
apostrophe (U+0027) but not the fancy Apostrophe U+2019.
Yet U+2019 the recommended Apostrophe to use, at least in
French.

In other words, a word like "jusqu'à" is OK but
a word like "jusqu’à" is marked as a typo (yet it should
the preferred spelling).  I see that the Breton Hunspell
accepts both apostrophes so we should be able to fix
it in French as is done for Breton. I have not had the time
to look at it yet.

> > As an experiment, I also commented out the Hunspell
> > rule in src/java/org/languagetool/language/Breton.java
> > and LT is then more than twice faster (even when comparing
> > using -d HUNSPELL_RULE).
>
> It's because loading a language activates the HunspellRule constructor,
> and the constructor reads files from disk.

I see that Daniel made it initialize lazily and now using -d HUNSPELL_RULE
has no measurable overhead anymore compared to the old
version prior to the Hunspell checkins.  See my new measurements
in a previous email. Good.

> I'm not a fan of hunspell; I think it has a wrong approach for creating
> suggestions because the computational complexity of its algorithm is
> simply too high. It should use something else, such as composition of a
> Levenshtein distance automaton with a dictionary automaton, and that
> would be really fast (such an approach is used by suggest methods in
> Lucene). Its "user-friendly" representation of affixation could be even
> nicer with twolc/lexc files for creating automata. Anyway, voikko (the
> Finish speller) might has scripts to convert hunspell files to such
> automata, and we might, in the future, use a better algorithm. If my
> algorithm for morfologik turns out to be implemented correctly, we may
> also use it for some languages -- Polish is *pathetically* slow in
> hunspell. Right now, however, for pragmatic reasons, I would vote for
> hunspell in LanguageTool 1.8.

OK.  I'll do more test in the coming days.  I hope I did not
sound too negative with all the reported issues. The speed is
also much better than in the initial checkin.  Somehow changing
the Hunspell lib sped it up.  I'm also glad that almost all issues with
Hunspell are fixed already, except at least for the U+2019 apostrophe
in French mentioned above.

Thanks!
-- Dominique

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: [Languagetool] How to enable spellchecking?

Reply via email to