Neither the current, nor the next one (probably) is a simple list.

The development version is limited on compounding with smaller words, 
but the mechanism is the same.

OpenTaal also has a words list, but coverage of that one is quite limited.

Unfortunately, I think e-mail communication makes transfer of knowledge 
hard.

I suggest we separate discussions about hunspell expansion and getting a 
compounding mechanism into the LT speller. These are separate issues.

Ruud


On 07-05-13 16:41, Marcin Miłkowski wrote:
> W dniu 2013-05-07 13:02, R.J. Baars pisze:
>> Marcin, Hunspell is (almost) perfectly sound, assuming there is a
>> reasonable text as input, and all features are used extensively.
>>
>> The issue is that compounding itself is not safe; it is a language issue.
>> Doing 'expansion' , the process is the other way around: all words, even
>> words never actually used, will be generated, and so enter the list.
>>
>> For Dutch we tried to prevent acceptation of nonsense words by adding lots
>> of 'Forfiddenwords', compounds accepted but incorrect or mostly confused.
> This blacklist would be easy to implement for us. Is the list
> represented simply as a word list or as a hunspell flag? (Both seem to
> be easy to support).
>
>> The list currently published for Hunspell is not very complete in that
>> area; OpenTaal is working hard on a much better version. We would at least
>> have to wait for that one.
> OK, so it's a simple list?
>
>> The expansion might even result in an explosion, since there is no actual
>> compounding limitation in Dutch.
>> Krokodillen+tranen+dal+stroom+gebied+s+veroverings+gedrag is valid.
>>
>> Extra assumptions for compound explosion could be limitation to 3 parts
>> and only compounding parts with >4 letters.
> We could use that if you think this is important. But again, remember
> that even a 1GB word list should not be a huge problem if it's so regular.
>
> Marcin
>
>> Ruud
>>
>>> Hi,
>>>
>>> So, let me get things straight: are you saying that hunspell does not
>>> support compounding for Dutch and German in a sensible way? I thought
>>> the problem was just to have hunspell capabilities in LanguageTool
>>> without native hunspell.
>>>
>>> If not, we should simply have a completely different way to support
>>> compounding in the speller. We could use JWordSplitter to split words,
>>> and have a list of blacklisted compounds. Or simply turn off the runon
>>> words feature for German and Dutch (which will be very easy in the next
>>> version of MorfologikSpeller, because the feature is already implemented).
>>>
>>> Right now I don't know what we need:
>>>
>>>     (1) simulation of hunspell capabilities
>>>     (2) compound splitting and some other ways to exclude mistakes
>>>     (3) a way to turn off runon splitting, and simply a word list of
>>> probable words.
>>>
>>> Regards,
>>> Marcin
>>>
>>> W dniu 2013-05-04 09:47, Ruud Baars pisze:
>>>> Thanks, Jan, for supporting.
>>>>
>>>> LT now appears to have 2 purposes for a words list: postagging and spell
>>>> checking.
>>>> Maybe this could be combined into one, just by adding a flag to the
>>>> words, with a error-probability value. Doing this, it would be possible
>>>> to still 'expand' a hunspell dictionary, to creat the biggest possible
>>>> words list for postagging, but keep the valuable spell checking info,
>>>> with correctness levels like 'known error (100%)', 'probable error',
>>>> 'might be error', 'extra info'
>>>> The levels less then 100% could be accompanied by rules as well.
>>>>
>>>> Ruud
>>>>
>>>>
>>>>
>>>> On 03-05-13 23:14, Jan Schreiber wrote:
>>>>> The problem with the compounds in Hunspell that Ruud described exists
>>>>> for German as well. Just saying.
>>>>>
>>>>> Am 03.05.2013 13:07, schrieb Ruud Baars:
>>>>>> Hi.
>>>>>>
>>>>>> Finally I have a full keyborad, to elaborate a bit on the expansion
>>>>>> issue.
>>>>>>
>>>>>> Spell checking is supposed signal any incorrect word. So most correct
>>>>>> words should be accepted.
>>>>>> There are words in between though. Words that are technically correct,
>>>>>> but in everyday language use mocht commonly a mistake for a different
>>>>>> word.
>>>>>>
>>>>>> Example for Dutch: si is one of the notes in do-re-mi-fa-sol-la-si-do.
>>>>>> So it is technically correct. But over 80% of the hits in Dutch
>>>>>> sentences it is a mistake for is. So it has intentionally been left
>>>>>> out
>>>>>> of the correct words list, even though it is correct.
>>>>>>
>>>>>> When compounding is uses, some compounding parts will accidentally
>>>>>> combine into a word that is technically correct, but still most of the
>>>>>> time a mistake. Example: a muskaatnoot (nutmeg) is correct, but also
>>>>>> muskaatnood could easily be generated, since nood (emergency) is a
>>>>>> compounder too.
>>>>>>
>>>>>> No matter how carefully compounds have been selected, lots of nonsense
>>>>>> words have been reported as Hunspell suggestions since the Hunspell
>>>>>> dictionary for Dutch introduced compounding.
>>>>>>
>>>>>> Because of that, it is not a good base material for expansion. The one
>>>>>> being fabricated now, to be released the end of this year (hopefully,
>>>>>> it
>>>>>> is 1 year leate then) could be better base material for expansion.
>>>>>>
>>>>>> Ruud
>>>>> ------------------------------------------------------------------------------
>>>>> Get 100% visibility into Java/.NET code with AppDynamics Lite
>>>>> It's a free troubleshooting tool designed for production
>>>>> Get down to code-level detail for bottlenecks, with <2% overhead.
>>>>> Download for free and get started troubleshooting in minutes.
>>>>> http://p.sf.net/sfu/appdyn_d2d_ap2
>>>>> _______________________________________________
>>>>> Languagetool-devel mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Get 100% visibility into Java/.NET code with AppDynamics Lite
>>>> It's a free troubleshooting tool designed for production
>>>> Get down to code-level detail for bottlenecks, with <2% overhead.
>>>> Download for free and get started troubleshooting in minutes.
>>>> http://p.sf.net/sfu/appdyn_d2d_ap2
>>>> _______________________________________________
>>>> Languagetool-devel mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>
>>>> .
>>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Learn Graph Databases - Download FREE O'Reilly Book
>>> "Graph Databases" is the definitive new guide to graph databases and
>>> their applications. This 200-page book is written by three acclaimed
>>> leaders in the field. The early access version is available now.
>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>> _______________________________________________
>>> Languagetool-devel mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and
>> their applications. This 200-page book is written by three acclaimed
>> leaders in the field. The early access version is available now.
>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>> _______________________________________________
>> Languagetool-devel mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>
>>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and
> their applications. This 200-page book is written by three acclaimed
> leaders in the field. The early access version is available now.
> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
> _______________________________________________
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel


------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and 
their applications. This 200-page book is written by three acclaimed 
leaders in the field. The early access version is available now. 
Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to