W dniu 2013-05-07 13:02, R.J. Baars pisze: > Marcin, Hunspell is (almost) perfectly sound, assuming there is a > reasonable text as input, and all features are used extensively. > > The issue is that compounding itself is not safe; it is a language issue. > Doing 'expansion' , the process is the other way around: all words, even > words never actually used, will be generated, and so enter the list. > > For Dutch we tried to prevent acceptation of nonsense words by adding lots > of 'Forfiddenwords', compounds accepted but incorrect or mostly confused.
This blacklist would be easy to implement for us. Is the list represented simply as a word list or as a hunspell flag? (Both seem to be easy to support). > The list currently published for Hunspell is not very complete in that > area; OpenTaal is working hard on a much better version. We would at least > have to wait for that one. OK, so it's a simple list? > > The expansion might even result in an explosion, since there is no actual > compounding limitation in Dutch. > Krokodillen+tranen+dal+stroom+gebied+s+veroverings+gedrag is valid. > > Extra assumptions for compound explosion could be limitation to 3 parts > and only compounding parts with >4 letters. We could use that if you think this is important. But again, remember that even a 1GB word list should not be a huge problem if it's so regular. Marcin > > Ruud > >> Hi, >> >> So, let me get things straight: are you saying that hunspell does not >> support compounding for Dutch and German in a sensible way? I thought >> the problem was just to have hunspell capabilities in LanguageTool >> without native hunspell. >> >> If not, we should simply have a completely different way to support >> compounding in the speller. We could use JWordSplitter to split words, >> and have a list of blacklisted compounds. Or simply turn off the runon >> words feature for German and Dutch (which will be very easy in the next >> version of MorfologikSpeller, because the feature is already implemented). >> >> Right now I don't know what we need: >> >> (1) simulation of hunspell capabilities >> (2) compound splitting and some other ways to exclude mistakes >> (3) a way to turn off runon splitting, and simply a word list of >> probable words. >> >> Regards, >> Marcin >> >> W dniu 2013-05-04 09:47, Ruud Baars pisze: >>> Thanks, Jan, for supporting. >>> >>> LT now appears to have 2 purposes for a words list: postagging and spell >>> checking. >>> Maybe this could be combined into one, just by adding a flag to the >>> words, with a error-probability value. Doing this, it would be possible >>> to still 'expand' a hunspell dictionary, to creat the biggest possible >>> words list for postagging, but keep the valuable spell checking info, >>> with correctness levels like 'known error (100%)', 'probable error', >>> 'might be error', 'extra info' >>> The levels less then 100% could be accompanied by rules as well. >>> >>> Ruud >>> >>> >>> >>> On 03-05-13 23:14, Jan Schreiber wrote: >>>> The problem with the compounds in Hunspell that Ruud described exists >>>> for German as well. Just saying. >>>> >>>> Am 03.05.2013 13:07, schrieb Ruud Baars: >>>>> Hi. >>>>> >>>>> Finally I have a full keyborad, to elaborate a bit on the expansion >>>>> issue. >>>>> >>>>> Spell checking is supposed signal any incorrect word. So most correct >>>>> words should be accepted. >>>>> There are words in between though. Words that are technically correct, >>>>> but in everyday language use mocht commonly a mistake for a different >>>>> word. >>>>> >>>>> Example for Dutch: si is one of the notes in do-re-mi-fa-sol-la-si-do. >>>>> So it is technically correct. But over 80% of the hits in Dutch >>>>> sentences it is a mistake for is. So it has intentionally been left >>>>> out >>>>> of the correct words list, even though it is correct. >>>>> >>>>> When compounding is uses, some compounding parts will accidentally >>>>> combine into a word that is technically correct, but still most of the >>>>> time a mistake. Example: a muskaatnoot (nutmeg) is correct, but also >>>>> muskaatnood could easily be generated, since nood (emergency) is a >>>>> compounder too. >>>>> >>>>> No matter how carefully compounds have been selected, lots of nonsense >>>>> words have been reported as Hunspell suggestions since the Hunspell >>>>> dictionary for Dutch introduced compounding. >>>>> >>>>> Because of that, it is not a good base material for expansion. The one >>>>> being fabricated now, to be released the end of this year (hopefully, >>>>> it >>>>> is 1 year leate then) could be better base material for expansion. >>>>> >>>>> Ruud >>>> ------------------------------------------------------------------------------ >>>> Get 100% visibility into Java/.NET code with AppDynamics Lite >>>> It's a free troubleshooting tool designed for production >>>> Get down to code-level detail for bottlenecks, with <2% overhead. >>>> Download for free and get started troubleshooting in minutes. >>>> http://p.sf.net/sfu/appdyn_d2d_ap2 >>>> _______________________________________________ >>>> Languagetool-devel mailing list >>>> [email protected] >>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>> >>> >>> ------------------------------------------------------------------------------ >>> Get 100% visibility into Java/.NET code with AppDynamics Lite >>> It's a free troubleshooting tool designed for production >>> Get down to code-level detail for bottlenecks, with <2% overhead. >>> Download for free and get started troubleshooting in minutes. >>> http://p.sf.net/sfu/appdyn_d2d_ap2 >>> _______________________________________________ >>> Languagetool-devel mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>> >>> . >>> >> >> >> ------------------------------------------------------------------------------ >> Learn Graph Databases - Download FREE O'Reilly Book >> "Graph Databases" is the definitive new guide to graph databases and >> their applications. This 200-page book is written by three acclaimed >> leaders in the field. The early access version is available now. >> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may >> _______________________________________________ >> Languagetool-devel mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >> > > > > ------------------------------------------------------------------------------ > Learn Graph Databases - Download FREE O'Reilly Book > "Graph Databases" is the definitive new guide to graph databases and > their applications. This 200-page book is written by three acclaimed > leaders in the field. The early access version is available now. > Download your free book today! http://p.sf.net/sfu/neotech_d2d_may > _______________________________________________ > Languagetool-devel mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > > ------------------------------------------------------------------------------ Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is the definitive new guide to graph databases and their applications. This 200-page book is written by three acclaimed leaders in the field. The early access version is available now. Download your free book today! http://p.sf.net/sfu/neotech_d2d_may _______________________________________________ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
