W dniu 2013-04-18 16:28, Daniel Naber pisze:
> On 18.04.2013, 14:41:21 Jaume Ortolà i Font wrote:
>
> Hi Jaume,
>
>> For achieving this, I think that some changes in a word should be
>> considered as representing a lesser "distance" from the original word
>> than others. In Catalan, for example, these changes could be:
>
> the right approach is to add this into the algorithm that traverses the
> dictionary tree. For German, I needed a solution fast and ended up with a
> hack in GermanSpellerRule. It's easy to understand, but if you could check
> the morfologik algorithm we use and improve that, it would be great. Marcin
> can probably give some directions I think.

Yes, Daniel is right. The dictionary tree is traversed quite 
straightforwardly; what you want is to prefer some paths not only based 
on simple distance, as we do now, but simply based on similarities. For 
that, fsa_spell uses a simple replacement table, which is easy for 
character-for-character substitution but does not work for 
character-multiple characters substitution. We simply need something 
more general, and we can generate candidate words during tree traversal. 
The code you need to change belongs to morfologik-speller, in Speller 
class, in particular findRepl. The idea of findRepl is quite easy: for a 
given depth (=position in the string) we try to find an entry in 
dictionary (arc) that is not too far in terms of the edit distance. If 
we still have edit distance < limit, and we find that the arc is 
terminal (no other characters in the word), then we add a candidate; 
otherwise, we move one character and try to find another candidate. This 
is all recursive, so we end up with multiple candidates.

findRepl simply uses a loop to go through all arcs in the dictionary and 
stops traversing if the edit distance is too far. There are basically 
two ways of changing this behavior:

(a) change the way edit distance is calculated in cuted(), by allowing 
for more flexible replacements, but this is quite tricky;

(b) do not call cuted for single character-multiple character 
substitutions in a given substitution table;

(c) prepare the list of all possible substitutions in the original word 
and use the original findRepl (similar to your idea, basically); but I'm 
not quite sure if this way, we would really find all candidates we want 
to find.

I think we need to focus on (a) or (b); maybe an additional parameter to 
cuted might tell the function to treat multiple character replacement as 
a single character replacement...

Now, cuted() uses a very smart (not mine!) way of precomputing of edit 
distances in a matrix, based on Jan Daciuk's improvements to Oflazer's 
algorithm. But there is a bug in the original algorithm (the matrix H 
that represents a matrix has wrong dimensions) and Jan does not have 
time to remove it. I don't know how to fix it - I made a dirty trick 
(look for "FIXME" in the code). This is probably quite trivial but I 
don't have time to reread Jan's code and Oflazer's paper.

Best,
Marcin

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to