Marcin, Below is a quote from the wiki on uncompounding. This algorithm seems to allow for lots of illegal compounds to be accepted. Examples: autoonderdeel, Een-Afrika, mannetje-deur What is needed minimally are checks on the split. Seems rather complex.
MorfologikSpeller uses a finite word list encoded as an automaton. To support compounding, we could use the following algorithm: If the word is not found on the list, try to decompose it into building blocks: prefixes, infixes, and suffixes, and other parts. This can be done by trying to find word parts in a similar way as replaceRunOnWords, i.e., by moving a space (but with a predefined maximum of compound words probably > 2 but <= 4). We need to mark up incorrect compounds (words commonly mistaked) with a FORBID tag after a standard separator. We need to mark up prefixes (words that cannot occur on any other position but as a prefix) with a PREFIX tag. We need to mark up suffixes with a SUFFIX tag. We need to mark up infixes with an INFIX tag. All non-FORBID words could be used to analyze a compound form. Words with PREFIX, SUFFIX and INFIX tags would not be proposed by replaceRunOnWords. To make them appear in suggestions, another instance of the same word without any tag would be needed (but suppression would still work). ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel