W dniu 2013-05-07 13:02, R.J. Baars pisze:
> Marcin, Hunspell is (almost) perfectly sound, assuming there is a
> reasonable text as input, and all features are used extensively.
>
> The issue is that compounding itself is not safe; it is a language issue.
> Doing 'expansion' , the process is the other way around: all words, even
> words never actually used, will be generated, and so enter the list.
>
> For Dutch we tried to prevent acceptation of nonsense words by adding lots
> of 'Forfiddenwords', compounds accepted but incorrect or mostly confused.

This blacklist would be easy to implement for us. Is the list 
represented simply as a word list or as a hunspell flag? (Both seem to 
be easy to support).

> The list currently published for Hunspell is not very complete in that
> area; OpenTaal is working hard on a much better version. We would at least
> have to wait for that one.

OK, so it's a simple list?

>
> The expansion might even result in an explosion, since there is no actual
> compounding limitation in Dutch.
> Krokodillen+tranen+dal+stroom+gebied+s+veroverings+gedrag is valid.
>
> Extra assumptions for compound explosion could be limitation to 3 parts
> and only compounding parts with >4 letters.

We could use that if you think this is important. But again, remember 
that even a 1GB word list should not be a huge problem if it's so regular.

Marcin

>
> Ruud
>
>> Hi,
>>
>> So, let me get things straight: are you saying that hunspell does not
>> support compounding for Dutch and German in a sensible way? I thought
>> the problem was just to have hunspell capabilities in LanguageTool
>> without native hunspell.
>>
>> If not, we should simply have a completely different way to support
>> compounding in the speller. We could use JWordSplitter to split words,
>> and have a list of blacklisted compounds. Or simply turn off the runon
>> words feature for German and Dutch (which will be very easy in the next
>> version of MorfologikSpeller, because the feature is already implemented).
>>
>> Right now I don't know what we need:
>>
>>    (1) simulation of hunspell capabilities
>>    (2) compound splitting and some other ways to exclude mistakes
>>    (3) a way to turn off runon splitting, and simply a word list of
>> probable words.
>>
>> Regards,
>> Marcin
>>
>> W dniu 2013-05-04 09:47, Ruud Baars pisze:
>>> Thanks, Jan, for supporting.
>>>
>>> LT now appears to have 2 purposes for a words list: postagging and spell
>>> checking.
>>> Maybe this could be combined into one, just by adding a flag to the
>>> words, with a error-probability value. Doing this, it would be possible
>>> to still 'expand' a hunspell dictionary, to creat the biggest possible
>>> words list for postagging, but keep the valuable spell checking info,
>>> with correctness levels like 'known error (100%)', 'probable error',
>>> 'might be error', 'extra info'
>>> The levels less then 100% could be accompanied by rules as well.
>>>
>>> Ruud
>>>
>>>
>>>
>>> On 03-05-13 23:14, Jan Schreiber wrote:
>>>> The problem with the compounds in Hunspell that Ruud described exists
>>>> for German as well. Just saying.
>>>>
>>>> Am 03.05.2013 13:07, schrieb Ruud Baars:
>>>>> Hi.
>>>>>
>>>>> Finally I have a full keyborad, to elaborate a bit on the expansion
>>>>> issue.
>>>>>
>>>>> Spell checking is supposed signal any incorrect word. So most correct
>>>>> words should be accepted.
>>>>> There are words in between though. Words that are technically correct,
>>>>> but in everyday language use mocht commonly a mistake for a different
>>>>> word.
>>>>>
>>>>> Example for Dutch: si is one of the notes in do-re-mi-fa-sol-la-si-do.
>>>>> So it is technically correct. But over 80% of the hits in Dutch
>>>>> sentences it is a mistake for is. So it has intentionally been left
>>>>> out
>>>>> of the correct words list, even though it is correct.
>>>>>
>>>>> When compounding is uses, some compounding parts will accidentally
>>>>> combine into a word that is technically correct, but still most of the
>>>>> time a mistake. Example: a muskaatnoot (nutmeg) is correct, but also
>>>>> muskaatnood could easily be generated, since nood (emergency) is a
>>>>> compounder too.
>>>>>
>>>>> No matter how carefully compounds have been selected, lots of nonsense
>>>>> words have been reported as Hunspell suggestions since the Hunspell
>>>>> dictionary for Dutch introduced compounding.
>>>>>
>>>>> Because of that, it is not a good base material for expansion. The one
>>>>> being fabricated now, to be released the end of this year (hopefully,
>>>>> it
>>>>> is 1 year leate then) could be better base material for expansion.
>>>>>
>>>>> Ruud
>>>> ------------------------------------------------------------------------------
>>>> Get 100% visibility into Java/.NET code with AppDynamics Lite
>>>> It's a free troubleshooting tool designed for production
>>>> Get down to code-level detail for bottlenecks, with <2% overhead.
>>>> Download for free and get started troubleshooting in minutes.
>>>> http://p.sf.net/sfu/appdyn_d2d_ap2
>>>> _______________________________________________
>>>> Languagetool-devel mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Get 100% visibility into Java/.NET code with AppDynamics Lite
>>> It's a free troubleshooting tool designed for production
>>> Get down to code-level detail for bottlenecks, with <2% overhead.
>>> Download for free and get started troubleshooting in minutes.
>>> http://p.sf.net/sfu/appdyn_d2d_ap2
>>> _______________________________________________
>>> Languagetool-devel mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>
>>> .
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and
>> their applications. This 200-page book is written by three acclaimed
>> leaders in the field. The early access version is available now.
>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>> _______________________________________________
>> Languagetool-devel mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>
>
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and
> their applications. This 200-page book is written by three acclaimed
> leaders in the field. The early access version is available now.
> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
> _______________________________________________
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
>


------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and 
their applications. This 200-page book is written by three acclaimed 
leaders in the field. The early access version is available now. 
Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to