W dniu 2013-05-07 17:47, Ruud Baars pisze:
> Neither the current, nor the next one (probably) is a simple list.
>
> The development version is limited on compounding with smaller words,
> but the mechanism is the same.
>
> OpenTaal also has a words list, but coverage of that one is quite limited.
>
> Unfortunately, I think e-mail communication makes transfer of knowledge
> hard.

Well, we don't have an IRC channel, and most of us don't have time for 
IRC anyway.

>
> I suggest we separate discussions about hunspell expansion and getting a
> compounding mechanism into the LT speller. These are separate issues.

Well, for me it seems to be the same issue still, as I haven't been 
given any reason to believe that hunspell expansion would not give me a 
compounding mechanism for our speller (beyond the size of the word list).

I'm not saying we have to emulate hunspell. Hunspell cannot support 
Finnish, for example (there's voikko), and Turkish (zemberek replaces 
it). We could adopt some other formalism to describe and support 
compounding. Our finite-state automata are fine but we need to know how 
to represent compounds in them.

Marcin

>
> Ruud
>
>
> On 07-05-13 16:41, Marcin Miłkowski wrote:
>> W dniu 2013-05-07 13:02, R.J. Baars pisze:
>>> Marcin, Hunspell is (almost) perfectly sound, assuming there is a
>>> reasonable text as input, and all features are used extensively.
>>>
>>> The issue is that compounding itself is not safe; it is a language issue.
>>> Doing 'expansion' , the process is the other way around: all words, even
>>> words never actually used, will be generated, and so enter the list.
>>>
>>> For Dutch we tried to prevent acceptation of nonsense words by adding lots
>>> of 'Forfiddenwords', compounds accepted but incorrect or mostly confused.
>> This blacklist would be easy to implement for us. Is the list
>> represented simply as a word list or as a hunspell flag? (Both seem to
>> be easy to support).
>>
>>> The list currently published for Hunspell is not very complete in that
>>> area; OpenTaal is working hard on a much better version. We would at least
>>> have to wait for that one.
>> OK, so it's a simple list?
>>
>>> The expansion might even result in an explosion, since there is no actual
>>> compounding limitation in Dutch.
>>> Krokodillen+tranen+dal+stroom+gebied+s+veroverings+gedrag is valid.
>>>
>>> Extra assumptions for compound explosion could be limitation to 3 parts
>>> and only compounding parts with >4 letters.
>> We could use that if you think this is important. But again, remember
>> that even a 1GB word list should not be a huge problem if it's so regular.
>>
>> Marcin
>>
>>> Ruud
>>>
>>>> Hi,
>>>>
>>>> So, let me get things straight: are you saying that hunspell does not
>>>> support compounding for Dutch and German in a sensible way? I thought
>>>> the problem was just to have hunspell capabilities in LanguageTool
>>>> without native hunspell.
>>>>
>>>> If not, we should simply have a completely different way to support
>>>> compounding in the speller. We could use JWordSplitter to split words,
>>>> and have a list of blacklisted compounds. Or simply turn off the runon
>>>> words feature for German and Dutch (which will be very easy in the next
>>>> version of MorfologikSpeller, because the feature is already implemented).
>>>>
>>>> Right now I don't know what we need:
>>>>
>>>>      (1) simulation of hunspell capabilities
>>>>      (2) compound splitting and some other ways to exclude mistakes
>>>>      (3) a way to turn off runon splitting, and simply a word list of
>>>> probable words.
>>>>
>>>> Regards,
>>>> Marcin
>>>>
>>>> W dniu 2013-05-04 09:47, Ruud Baars pisze:
>>>>> Thanks, Jan, for supporting.
>>>>>
>>>>> LT now appears to have 2 purposes for a words list: postagging and spell
>>>>> checking.
>>>>> Maybe this could be combined into one, just by adding a flag to the
>>>>> words, with a error-probability value. Doing this, it would be possible
>>>>> to still 'expand' a hunspell dictionary, to creat the biggest possible
>>>>> words list for postagging, but keep the valuable spell checking info,
>>>>> with correctness levels like 'known error (100%)', 'probable error',
>>>>> 'might be error', 'extra info'
>>>>> The levels less then 100% could be accompanied by rules as well.
>>>>>
>>>>> Ruud
>>>>>
>>>>>
>>>>>
>>>>> On 03-05-13 23:14, Jan Schreiber wrote:
>>>>>> The problem with the compounds in Hunspell that Ruud described exists
>>>>>> for German as well. Just saying.
>>>>>>
>>>>>> Am 03.05.2013 13:07, schrieb Ruud Baars:
>>>>>>> Hi.
>>>>>>>
>>>>>>> Finally I have a full keyborad, to elaborate a bit on the expansion
>>>>>>> issue.
>>>>>>>
>>>>>>> Spell checking is supposed signal any incorrect word. So most correct
>>>>>>> words should be accepted.
>>>>>>> There are words in between though. Words that are technically correct,
>>>>>>> but in everyday language use mocht commonly a mistake for a different
>>>>>>> word.
>>>>>>>
>>>>>>> Example for Dutch: si is one of the notes in do-re-mi-fa-sol-la-si-do.
>>>>>>> So it is technically correct. But over 80% of the hits in Dutch
>>>>>>> sentences it is a mistake for is. So it has intentionally been left
>>>>>>> out
>>>>>>> of the correct words list, even though it is correct.
>>>>>>>
>>>>>>> When compounding is uses, some compounding parts will accidentally
>>>>>>> combine into a word that is technically correct, but still most of the
>>>>>>> time a mistake. Example: a muskaatnoot (nutmeg) is correct, but also
>>>>>>> muskaatnood could easily be generated, since nood (emergency) is a
>>>>>>> compounder too.
>>>>>>>
>>>>>>> No matter how carefully compounds have been selected, lots of nonsense
>>>>>>> words have been reported as Hunspell suggestions since the Hunspell
>>>>>>> dictionary for Dutch introduced compounding.
>>>>>>>
>>>>>>> Because of that, it is not a good base material for expansion. The one
>>>>>>> being fabricated now, to be released the end of this year (hopefully,
>>>>>>> it
>>>>>>> is 1 year leate then) could be better base material for expansion.
>>>>>>>
>>>>>>> Ruud
>>>>>> ------------------------------------------------------------------------------
>>>>>> Get 100% visibility into Java/.NET code with AppDynamics Lite
>>>>>> It's a free troubleshooting tool designed for production
>>>>>> Get down to code-level detail for bottlenecks, with <2% overhead.
>>>>>> Download for free and get started troubleshooting in minutes.
>>>>>> http://p.sf.net/sfu/appdyn_d2d_ap2
>>>>>> _______________________________________________
>>>>>> Languagetool-devel mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Get 100% visibility into Java/.NET code with AppDynamics Lite
>>>>> It's a free troubleshooting tool designed for production
>>>>> Get down to code-level detail for bottlenecks, with <2% overhead.
>>>>> Download for free and get started troubleshooting in minutes.
>>>>> http://p.sf.net/sfu/appdyn_d2d_ap2
>>>>> _______________________________________________
>>>>> Languagetool-devel mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>>
>>>>> .
>>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Learn Graph Databases - Download FREE O'Reilly Book
>>>> "Graph Databases" is the definitive new guide to graph databases and
>>>> their applications. This 200-page book is written by three acclaimed
>>>> leaders in the field. The early access version is available now.
>>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>>> _______________________________________________
>>>> Languagetool-devel mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Learn Graph Databases - Download FREE O'Reilly Book
>>> "Graph Databases" is the definitive new guide to graph databases and
>>> their applications. This 200-page book is written by three acclaimed
>>> leaders in the field. The early access version is available now.
>>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>>> _______________________________________________
>>> Languagetool-devel mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>
>>>
>>
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and
>> their applications. This 200-page book is written by three acclaimed
>> leaders in the field. The early access version is available now.
>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>> _______________________________________________
>> Languagetool-devel mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and
> their applications. This 200-page book is written by three acclaimed
> leaders in the field. The early access version is available now.
> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
> _______________________________________________
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>


------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and 
their applications. This 200-page book is written by three acclaimed 
leaders in the field. The early access version is available now. 
Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to