The attachment.

Jaume

2013/4/20 Jaume Ortolà i Font <[email protected]>

> Hi,
>
> As Marcin said, single character-multiple character substitutions are
> tricky in the algorithm. So far I have come up with no solution.
>
> For the rest of cases I explained before, I think there is an easy and
> general solution. When comparing characters at a given depth between the
> original word and the candidate in ed(), this comparison can be made
> case-insensitive and "diacritics-insensitive". This way we'll get more
> proper suggestions. And furthermore if the substitutions are only case
> changes or addition/removal of diacritics, then the total distance will be
> zero, and these suggestions will appear in the first positions, as desired.
> Is this valid for all languages?
>
> I attach Speller.java with my implementation (search for "by Jaume
> Ortola"). Probably it can be implemented in some more efficient way.
>
> When using case-insensitive character comparison, there is a bug. But I
> think it already existed previously. Some strings ending in "_A" are
> accepted as candidates. For example, "Pec_A" is a suggestion for "Pecra".
> These suggestions can be easily discarded, but the bug probably indicates
> that something is not done properly in the algorithm. The problem
> disappears when you use case-sensitive character comparison.
>
> Best,
> Jaume Ortolà
>
>
> Salutacions,
> Jaume Ortolà
> www.riuraueditors.cat
>
>
>
> 2013/4/18 Marcin Miłkowski <[email protected]>
>
>> W dniu 2013-04-18 16:28, Daniel Naber pisze:
>> > On 18.04.2013, 14:41:21 Jaume Ortolà i Font wrote:
>> >
>> > Hi Jaume,
>> >
>> >> For achieving this, I think that some changes in a word should be
>> >> considered as representing a lesser "distance" from the original word
>> >> than others. In Catalan, for example, these changes could be:
>> >
>> > the right approach is to add this into the algorithm that traverses the
>> > dictionary tree. For German, I needed a solution fast and ended up with
>> a
>> > hack in GermanSpellerRule. It's easy to understand, but if you could
>> check
>> > the morfologik algorithm we use and improve that, it would be great.
>> Marcin
>> > can probably give some directions I think.
>>
>> Yes, Daniel is right. The dictionary tree is traversed quite
>> straightforwardly; what you want is to prefer some paths not only based
>> on simple distance, as we do now, but simply based on similarities. For
>> that, fsa_spell uses a simple replacement table, which is easy for
>> character-for-character substitution but does not work for
>> character-multiple characters substitution. We simply need something
>> more general, and we can generate candidate words during tree traversal.
>> The code you need to change belongs to morfologik-speller, in Speller
>> class, in particular findRepl. The idea of findRepl is quite easy: for a
>> given depth (=position in the string) we try to find an entry in
>> dictionary (arc) that is not too far in terms of the edit distance. If
>> we still have edit distance < limit, and we find that the arc is
>> terminal (no other characters in the word), then we add a candidate;
>> otherwise, we move one character and try to find another candidate. This
>> is all recursive, so we end up with multiple candidates.
>>
>> findRepl simply uses a loop to go through all arcs in the dictionary and
>> stops traversing if the edit distance is too far. There are basically
>> two ways of changing this behavior:
>>
>> (a) change the way edit distance is calculated in cuted(), by allowing
>> for more flexible replacements, but this is quite tricky;
>>
>> (b) do not call cuted for single character-multiple character
>> substitutions in a given substitution table;
>>
>> (c) prepare the list of all possible substitutions in the original word
>> and use the original findRepl (similar to your idea, basically); but I'm
>> not quite sure if this way, we would really find all candidates we want
>> to find.
>>
>> I think we need to focus on (a) or (b); maybe an additional parameter to
>> cuted might tell the function to treat multiple character replacement as
>> a single character replacement...
>>
>> Now, cuted() uses a very smart (not mine!) way of precomputing of edit
>> distances in a matrix, based on Jan Daciuk's improvements to Oflazer's
>> algorithm. But there is a bug in the original algorithm (the matrix H
>> that represents a matrix has wrong dimensions) and Jan does not have
>> time to remove it. I don't know how to fix it - I made a dirty trick
>> (look for "FIXME" in the code). This is probably quite trivial but I
>> don't have time to reread Jan's code and Oflazer's paper.
>>
>> Best,
>> Marcin
>>
>>
>> ------------------------------------------------------------------------------
>> Precog is a next-generation analytics platform capable of advanced
>> analytics on semi-structured data. The platform includes APIs for building
>> apps and a phenomenal toolset for data science. Developers can use
>> our toolset for easy data analysis & visualization. Get a free account!
>> http://www2.precog.com/precogplatform/slashdotnewsletter
>> _______________________________________________
>> Languagetool-devel mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>
>
>

Attachment: Speller.java
Description: Binary data

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to