Hi all,
2013/4/25 Jaume Ortolà i Font <[email protected]>
> As predicted, the code I wrote for multiple character substitutions had
> several bugs. I solved them (see the attachment), but more problems could
> arise with other languages or other substitutions.
>
>
OK, I can see you changed the H Matrix when generating the replacements. I
can understand the logic but the only way to be really sure if we get what
we want is to start with a lot of cases in JUnit tests. I'm not sure if I
will be ready with the property-reading code very soon, but we really
should start from creating test cases. Maybe I'll find some time in the
next couple of days to write it up. Then we'd simply have an interface to
test the replacements.
> Here I would like to talk about another approach for generating spelling
> suggestions: just checking the words with substitutions directly.
>
That was one of my early ideas as well. I'm not saying this is the best
idea one could have.
> Several steps could be done, but each step is taken only if no suggestions
> have been found in the previous one. These could be the steps:
>
> 1) Make a tree search.
> 2) Prepare words with substitutions. Are they misspelled words?
> 3) Make a new tree search of words with substitutions.
>
> Note that step 2) is very low cost, and step 3) is high cost. Step 2)
> could even be the first step.
>
I don't see why (3) would be a high cost. Searching for candidates with
substitutions is anyway quite limited by setting the edit distance and it's
roughly as complex as checking whether the word is in the dictionary. If we
have more candidates, then we'd have a linear growth of processing time (if
search time for a single word is 3, then for n candidates, the time would
be 3n, which is nothing really dangerous). Of course, if you don't trust my
analysis, the easiest way is to check is is to run morfologikspeller rule
on a large corpus and profile it (we have a switch on the command line).
>
> Would this approach be more or less efficient? It depends on the kind and
> the number of errors we find in the texts. When there is only one or more
> errors of multiple character substitution, then it will be faster. When
> there is one error of multiple character substitution plus another kind of
> error, then it will be slower. So the only way to decide which is better is
> to try both and see which is better statistically.
>
> Note that using multiple character substitution inside the tree search
> algorithm is not so costly as repeating the tree search, but it is
> something in between.
>
The cost is really low anyway. I wouldn't worry about it.
Speaking of improvements, I think we need to change runonwords somewhat.
For Polish and possibly other languages, there are words which are correct
but usually are not used separately. For example, we usually don't want
prefix words to be proposed: if "postrealism" is not in the dictionary, we
would get "post realism" as a proposal, which may not make much sense
anyway. It's better to supress such suggestions. We could have an ignore
list of prefix or suffix words to ensure that such candidates are not
proposed. The easiest way to include them would be to use a second field in
the dictionary (the first one should be generated from frequency lists,
which is something that I planned to do but had no time - Google ngrams
would be probably a great resource for this):
post+A+N
"A" would be the frequency class (very frequent), "R" would be Non-runon
word.
After implementing this, and the replacements thingy, we would only need to
have a good way to create automata from hunspell dictionaries. These
dictionaries are creating finite lists of words anyway (even with 2-level
morphology), so it should be fairly easy to process them using a decent
Java reader.
Regards,
Marcin
>
> Best regards,
> Jaume
>
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
> _______________________________________________
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
>
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel