Re: More on spelling suggestions

Marcin Miłkowski Mon, 29 Apr 2013 14:37:46 -0700

W dniu 2013-04-29 17:56, Jaume Ortolà i Font pisze:
> Hi,
>
> I have made the algorithm more general. Now we get really all the
> possible combinations of replacements including multiple and different
> substitutions inside a word.
>
> For example, for "tel·lenovela" we get four suggestions: "telenovel·la,
> tel·lenovela, telenovela, tel·lenovel·la". Note that there are
> replacements L->L·L and L·L->L at the same time, which are necessary to
> find the right suggestion, the first one.
>
> The idea is that once we find a possible replacement we start two new
> searches recursively: one with the replacement and another without the
> replacement. Both new searches are done from an increased index, not
> from the start of the word.
>
> The "candidates" are added to the list only when we reach a terminal
> branch where no replacement is possible.
>
> The key and replacement pairs are now iterated
> inside getAllReplacements(), so we can get different kinds of
> replacements inside the word.
>
> This works fine with all my examples in Catalan.

I did some further tests (and wrote a Junit test for 4 replacements, and
it works fine, I get 15 replacements + no replacement at all) and it
works perfectly. I changed your code -- isMisspelled may return true
because of the flags set in properties, and hence some replacements
might be accepted just because they are all uppercase or something
similar. We don't want such side effects.

I introduced two new properties:

fsa.dict.speller.ignore-camel-case

and

fsa.dict.speller.ignore-all-uppercase

They are self-explanatory.

I think this code is quite fine right now. What we need then is to have
the prefix/suffix rejection for runon words (still not sure if this
should be an entry in the .info file) and the conversion of the hunspell
file.

I read the code at lucene-hunspell
(https://code.google.com/p/lucene-hunspell/source/browse/#svn%2Ftrunk%2Fsrc%2Fjava%2Forg%2Fapache%2Flucene%2Fanalysis%2Fhunspell)

but it uses a different strategy than I would. Namely, it just tries to
stem words, and it isn't useful to unmunch the dictionary. I'm not sure
if this code is the one really used with lucene (could anyone confirm?)
as it's quite a bit old now and the version number is just 0.2. Anyway,
it parses the affix file, which is good, and we would need to read the
whole hunspell file and produce all affixed forms of the word. I don't
really get CharArrayMap used in the code to represent lookup tables, but
I think the modification that we need is in HunspellDictionary.java, and
we need to have a new method that simply iterates through all words in
the words CharArrayMap and applies all prefixes and suffixes that match.
Also, we would need, unfortunately, to add support for COMPOUNDMIN,
COMPOUNDFLAG etc. This would be very, very similar to affix support that
is already there.

Regards,
Marcin

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: More on spelling suggestions

Reply via email to