As far as I know, there is no simple replace rule in Dutch. Looks like a
nice addition though, especially for difficult errors like
'stofgezogen'=>'gestofzuigd'.

I don't know what the impact is of changing from Hunspell to the Java
speller. It is not dynamic, detected at start, I assume (have .dic and
.aff => hunspell, have a .dict and Java dict is used?

In that case, getting ahead is more difficult, since my computer is
already at 100% all the time ;-)

If it is not easy to make it configurable, I would like to try to switch
to the Java dictionary so the testing of the speller runs in the same go
as testing the rules.

Ruud

> W dniu 2014-09-03 14:26, R.J. Baars pisze:
>>
>> Marcin,
>>
>> I filtered the frequencies for any word found more than 50 times; thus
>> 800.000 frequencies, about 4 times the size of the internet file.
>> It adds about 0,4 MB to the dictionary, now in total 9.7 MB.
>>
>> The dictionary still needs some improvement (full upercase words longer
>> than 5 chars are in there e.g., not confoming advice of the Dutch
>> Language
>> Union.
>> But that is for later concern.
>>
>> I added lower- and uppercased words, because I am not sure what
>> algorithms
>> are used for case. If the word found is 'Fuond', and 'found' is in the
>> dictionary, I assume default behaviour is to suggest 'Found'. Accepted
>> forms are 'found', 'Found' and 'FOUND'. (Is that assumption correct?)
>
> Yes.
>
>>
>> I need some words to be only accepted in lowercase, like 'tv', which
>> only
>> has the correct forms 'Tv' and 'tv'; 'TV' is wrong. Same for soem other
>> words. (In hunspell I used the keepcase flag on those words).
>
> Hm, I'm not sure. But you can easily put that to a separate common
> simple mistakes file (for SimpleReplaceRule). I found maintaining such a
> file easier than trying to use the same dictionary-search method for
> suggestions. It was particularly difficult for two- and three-letter
> words, and with a SimpleReplaceRule it's just a matter of putting the
> word to the file like this:
>
> TV    tv
>
> And appropriate uppercasing will be applied by the rule anyway.
>
>>
>> So I have now a dictionary to test, and to tune for replacements.
>> Is there a way to run a words list through this speller and get the
>> suggestions out?
>
> You could simply replace the file for one of the English variants and
> run LT on the command line with only spelling rule enabled. For example,
> for British English, simply enable only MORFOLOGIK_RULE_EN_GB (the
> command-line switch is "-e MORFOLOGIK_RULE_EN_GB"). That should be the
> easiest way. And you can then compare how it worked on the same file
> with the Dutch hunspell enabled (as you don't have to touch the Dutch
> files yet).
>
> Marcin
>
>>
>> Ruud
>>
>>> W dniu 2014-09-03 12:30, R.J. Baars pisze:
>>>> Marcin,
>>>>
>>>> For English, there are .info files in /resource/ as well as in
>>>> /resource/hunspell.
>>>> First seems to be for the tagging dict, second for the speller.
>>> Ah, of course, there should be one .info file per one .dict file. I
>>> thought you were asking about one dictionary file.
>>>
>>>>
>>>> (I would prefer spell-checker for directory name.)
>>>>
>>>> The content of the info file for Dutch should probably be:
>>>> fsa.dict.speller.ignore-numbers=false
>>>> fsa.dict.speller.ignore-all-uppercase=false
>>>> fsa.dict.speller.ignore-camel-case=true
>>>> fsa.dict.speller.ignore-punctuation=false
>>> Note: if you don't have all punctuation in your dictionary, this will
>>> make the speller complain on all commas, colons, hyphens etc.
>>>
>>>> fsa.dict.input-conversion=ij ij, IJ IJ
>>>
>>> You need to use normal Unicode here or Java escaping, not HTML
>>> escaping.
>>>
>>>> fsa.dict.output-conversion=ij ij, IJ IJ
>>> Do you have such characters in the dictionary file? If not, then you
>>> don't need the output conversion.
>>>
>>>> fsa.dict.speller.runon-words=false
>>>> fsa.dict.speller.locale=nl_NL
>>>> fsa.dict.speller.convert-case=false
>>>> fsa.dict.speller.ignore-diacritics=true
>>>> fsa.dict.speller.replacement-pairs=y ij, ei ij
>>>> fsa.dict.speller.equivalent-chars=
>>>> fsa.dict.frequency-included=true
>>>> fsa.dict.encoding=utf-8
>>>> fsa.dict.separator=
>>>> fsa.dict.author=R. Baars;
>>>>
>>>> I am not sure about separator , equivalent chars and the locale.
>>> Separator is just used for internal management (usually it's a plus
>>> character). Doesn't really matter unless you want to use "+" as an
>>> entry
>>> (and you would have to if you have "ignore-punctuation" set to false).
>>>
>>>> I don quite get the difference between diacritics, equivalent chars
>>>> and
>>>> replacment pairs. Diacritics seems to me to be part of equivalent and
>>>> is
>>>> a
>>>> kind of automatic replacement.
>>> Diacritics is automatic and faster than replacement pairs. Roughly the
>>> same as equivalent chars.
>>>
>>>> ei ij is a replacement, á and a are taken care of by diacritics,
>>>> and I
>>>> guess Dutch does not have equivalents ...
>>>>
>>>> Right?
>>> What about apostrophes? Do you want them normalized or not?
>>>
>>> Regards,
>>> Marcin
>>>
>>>>
>>>>
>>>>
>>>>> W dniu 2014-09-03 10:58, R.J. Baars pisze:
>>>>>> To add the words frequencis, I am directed by the wiki to an address
>>>>>> where
>>>>>> there is a frequency list indeed. But only 187000 words; while I
>>>>>> have
>>>>>> 1.2
>>>>>> million Dutch words and their frequency myself.
>>>>> Probably the probabilities of their occurrence is quite low. I tried
>>>>> replacing that list with a bigger one for Polish and my results
>>>>> indeed
>>>>> made the dictionary file bigger but nothing else changed much.
>>>>>
>>>>>> The frequency is just a number; what is expected there? I this
>>>>>> number
>>>>>> a
>>>>>> plain ratio, a occurrence count, or something else, like
>>>>>> logarithmic?
>>>>>> Will I have to convert to that format, or is a plain word<tab>number
>>>>>> an
>>>>>> option too?
>>>>> Log scale, I believe. You might want to filter out some of the lower
>>>>> results, as well, as they don't really help and only make files
>>>>> bigger.
>>>>>
>>>>> Marcin
>>>>>
>>>>>> Ruud
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> Slashdot TV.
>>>>>> Video for Nerds.  Stuff that matters.
>>>>>> http://tv.slashdot.org/
>>>>>> _______________________________________________
>>>>>> Languagetool-devel mailing list
>>>>>> Languagetool-devel@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>>>
>>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Slashdot TV.
>>>>> Video for Nerds.  Stuff that matters.
>>>>> http://tv.slashdot.org/
>>>>> _______________________________________________
>>>>> Languagetool-devel mailing list
>>>>> Languagetool-devel@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Slashdot TV.
>>>> Video for Nerds.  Stuff that matters.
>>>> http://tv.slashdot.org/
>>>> _______________________________________________
>>>> Languagetool-devel mailing list
>>>> Languagetool-devel@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Slashdot TV.
>>> Video for Nerds.  Stuff that matters.
>>> http://tv.slashdot.org/
>>> _______________________________________________
>>> Languagetool-devel mailing list
>>> Languagetool-devel@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Slashdot TV.
>> Video for Nerds.  Stuff that matters.
>> http://tv.slashdot.org/
>> _______________________________________________
>> Languagetool-devel mailing list
>> Languagetool-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>
>>
>
>
> ------------------------------------------------------------------------------
> Slashdot TV.
> Video for Nerds.  Stuff that matters.
> http://tv.slashdot.org/
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>



------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to