Neither the current, nor the next one (probably) is a simple list. The development version is limited on compounding with smaller words, but the mechanism is the same.
OpenTaal also has a words list, but coverage of that one is quite limited. Unfortunately, I think e-mail communication makes transfer of knowledge hard. I suggest we separate discussions about hunspell expansion and getting a compounding mechanism into the LT speller. These are separate issues. Ruud On 07-05-13 16:41, Marcin Miłkowski wrote: > W dniu 2013-05-07 13:02, R.J. Baars pisze: >> Marcin, Hunspell is (almost) perfectly sound, assuming there is a >> reasonable text as input, and all features are used extensively. >> >> The issue is that compounding itself is not safe; it is a language issue. >> Doing 'expansion' , the process is the other way around: all words, even >> words never actually used, will be generated, and so enter the list. >> >> For Dutch we tried to prevent acceptation of nonsense words by adding lots >> of 'Forfiddenwords', compounds accepted but incorrect or mostly confused. > This blacklist would be easy to implement for us. Is the list > represented simply as a word list or as a hunspell flag? (Both seem to > be easy to support). > >> The list currently published for Hunspell is not very complete in that >> area; OpenTaal is working hard on a much better version. We would at least >> have to wait for that one. > OK, so it's a simple list? > >> The expansion might even result in an explosion, since there is no actual >> compounding limitation in Dutch. >> Krokodillen+tranen+dal+stroom+gebied+s+veroverings+gedrag is valid. >> >> Extra assumptions for compound explosion could be limitation to 3 parts >> and only compounding parts with >4 letters. > We could use that if you think this is important. But again, remember > that even a 1GB word list should not be a huge problem if it's so regular. > > Marcin > >> Ruud >> >>> Hi, >>> >>> So, let me get things straight: are you saying that hunspell does not >>> support compounding for Dutch and German in a sensible way? I thought >>> the problem was just to have hunspell capabilities in LanguageTool >>> without native hunspell. >>> >>> If not, we should simply have a completely different way to support >>> compounding in the speller. We could use JWordSplitter to split words, >>> and have a list of blacklisted compounds. Or simply turn off the runon >>> words feature for German and Dutch (which will be very easy in the next >>> version of MorfologikSpeller, because the feature is already implemented). >>> >>> Right now I don't know what we need: >>> >>> (1) simulation of hunspell capabilities >>> (2) compound splitting and some other ways to exclude mistakes >>> (3) a way to turn off runon splitting, and simply a word list of >>> probable words. >>> >>> Regards, >>> Marcin >>> >>> W dniu 2013-05-04 09:47, Ruud Baars pisze: >>>> Thanks, Jan, for supporting. >>>> >>>> LT now appears to have 2 purposes for a words list: postagging and spell >>>> checking. >>>> Maybe this could be combined into one, just by adding a flag to the >>>> words, with a error-probability value. Doing this, it would be possible >>>> to still 'expand' a hunspell dictionary, to creat the biggest possible >>>> words list for postagging, but keep the valuable spell checking info, >>>> with correctness levels like 'known error (100%)', 'probable error', >>>> 'might be error', 'extra info' >>>> The levels less then 100% could be accompanied by rules as well. >>>> >>>> Ruud >>>> >>>> >>>> >>>> On 03-05-13 23:14, Jan Schreiber wrote: >>>>> The problem with the compounds in Hunspell that Ruud described exists >>>>> for German as well. Just saying. >>>>> >>>>> Am 03.05.2013 13:07, schrieb Ruud Baars: >>>>>> Hi. >>>>>> >>>>>> Finally I have a full keyborad, to elaborate a bit on the expansion >>>>>> issue. >>>>>> >>>>>> Spell checking is supposed signal any incorrect word. So most correct >>>>>> words should be accepted. >>>>>> There are words in between though. Words that are technically correct, >>>>>> but in everyday language use mocht commonly a mistake for a different >>>>>> word. >>>>>> >>>>>> Example for Dutch: si is one of the notes in do-re-mi-fa-sol-la-si-do. >>>>>> So it is technically correct. But over 80% of the hits in Dutch >>>>>> sentences it is a mistake for is. So it has intentionally been left >>>>>> out >>>>>> of the correct words list, even though it is correct. >>>>>> >>>>>> When compounding is uses, some compounding parts will accidentally >>>>>> combine into a word that is technically correct, but still most of the >>>>>> time a mistake. Example: a muskaatnoot (nutmeg) is correct, but also >>>>>> muskaatnood could easily be generated, since nood (emergency) is a >>>>>> compounder too. >>>>>> >>>>>> No matter how carefully compounds have been selected, lots of nonsense >>>>>> words have been reported as Hunspell suggestions since the Hunspell >>>>>> dictionary for Dutch introduced compounding. >>>>>> >>>>>> Because of that, it is not a good base material for expansion. The one >>>>>> being fabricated now, to be released the end of this year (hopefully, >>>>>> it >>>>>> is 1 year leate then) could be better base material for expansion. >>>>>> >>>>>> Ruud >>>>> ------------------------------------------------------------------------------ >>>>> Get 100% visibility into Java/.NET code with AppDynamics Lite >>>>> It's a free troubleshooting tool designed for production >>>>> Get down to code-level detail for bottlenecks, with <2% overhead. >>>>> Download for free and get started troubleshooting in minutes. >>>>> http://p.sf.net/sfu/appdyn_d2d_ap2 >>>>> _______________________________________________ >>>>> Languagetool-devel mailing list >>>>> [email protected] >>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>> >>>> ------------------------------------------------------------------------------ >>>> Get 100% visibility into Java/.NET code with AppDynamics Lite >>>> It's a free troubleshooting tool designed for production >>>> Get down to code-level detail for bottlenecks, with <2% overhead. >>>> Download for free and get started troubleshooting in minutes. >>>> http://p.sf.net/sfu/appdyn_d2d_ap2 >>>> _______________________________________________ >>>> Languagetool-devel mailing list >>>> [email protected] >>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>> >>>> . >>>> >>> >>> ------------------------------------------------------------------------------ >>> Learn Graph Databases - Download FREE O'Reilly Book >>> "Graph Databases" is the definitive new guide to graph databases and >>> their applications. This 200-page book is written by three acclaimed >>> leaders in the field. The early access version is available now. >>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may >>> _______________________________________________ >>> Languagetool-devel mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>> >> >> >> ------------------------------------------------------------------------------ >> Learn Graph Databases - Download FREE O'Reilly Book >> "Graph Databases" is the definitive new guide to graph databases and >> their applications. This 200-page book is written by three acclaimed >> leaders in the field. The early access version is available now. >> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may >> _______________________________________________ >> Languagetool-devel mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >> >> > > ------------------------------------------------------------------------------ > Learn Graph Databases - Download FREE O'Reilly Book > "Graph Databases" is the definitive new guide to graph databases and > their applications. This 200-page book is written by three acclaimed > leaders in the field. The early access version is available now. > Download your free book today! http://p.sf.net/sfu/neotech_d2d_may > _______________________________________________ > Languagetool-devel mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/languagetool-devel ------------------------------------------------------------------------------ Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is the definitive new guide to graph databases and their applications. This 200-page book is written by three acclaimed leaders in the field. The early access version is available now. Download your free book today! http://p.sf.net/sfu/neotech_d2d_may _______________________________________________ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
