As far as I know, there is no simple replace rule in Dutch. Looks like a nice addition though, especially for difficult errors like 'stofgezogen'=>'gestofzuigd'.
I don't know what the impact is of changing from Hunspell to the Java speller. It is not dynamic, detected at start, I assume (have .dic and .aff => hunspell, have a .dict and Java dict is used? In that case, getting ahead is more difficult, since my computer is already at 100% all the time ;-) If it is not easy to make it configurable, I would like to try to switch to the Java dictionary so the testing of the speller runs in the same go as testing the rules. Ruud > W dniu 2014-09-03 14:26, R.J. Baars pisze: >> >> Marcin, >> >> I filtered the frequencies for any word found more than 50 times; thus >> 800.000 frequencies, about 4 times the size of the internet file. >> It adds about 0,4 MB to the dictionary, now in total 9.7 MB. >> >> The dictionary still needs some improvement (full upercase words longer >> than 5 chars are in there e.g., not confoming advice of the Dutch >> Language >> Union. >> But that is for later concern. >> >> I added lower- and uppercased words, because I am not sure what >> algorithms >> are used for case. If the word found is 'Fuond', and 'found' is in the >> dictionary, I assume default behaviour is to suggest 'Found'. Accepted >> forms are 'found', 'Found' and 'FOUND'. (Is that assumption correct?) > > Yes. > >> >> I need some words to be only accepted in lowercase, like 'tv', which >> only >> has the correct forms 'Tv' and 'tv'; 'TV' is wrong. Same for soem other >> words. (In hunspell I used the keepcase flag on those words). > > Hm, I'm not sure. But you can easily put that to a separate common > simple mistakes file (for SimpleReplaceRule). I found maintaining such a > file easier than trying to use the same dictionary-search method for > suggestions. It was particularly difficult for two- and three-letter > words, and with a SimpleReplaceRule it's just a matter of putting the > word to the file like this: > > TV tv > > And appropriate uppercasing will be applied by the rule anyway. > >> >> So I have now a dictionary to test, and to tune for replacements. >> Is there a way to run a words list through this speller and get the >> suggestions out? > > You could simply replace the file for one of the English variants and > run LT on the command line with only spelling rule enabled. For example, > for British English, simply enable only MORFOLOGIK_RULE_EN_GB (the > command-line switch is "-e MORFOLOGIK_RULE_EN_GB"). That should be the > easiest way. And you can then compare how it worked on the same file > with the Dutch hunspell enabled (as you don't have to touch the Dutch > files yet). > > Marcin > >> >> Ruud >> >>> W dniu 2014-09-03 12:30, R.J. Baars pisze: >>>> Marcin, >>>> >>>> For English, there are .info files in /resource/ as well as in >>>> /resource/hunspell. >>>> First seems to be for the tagging dict, second for the speller. >>> Ah, of course, there should be one .info file per one .dict file. I >>> thought you were asking about one dictionary file. >>> >>>> >>>> (I would prefer spell-checker for directory name.) >>>> >>>> The content of the info file for Dutch should probably be: >>>> fsa.dict.speller.ignore-numbers=false >>>> fsa.dict.speller.ignore-all-uppercase=false >>>> fsa.dict.speller.ignore-camel-case=true >>>> fsa.dict.speller.ignore-punctuation=false >>> Note: if you don't have all punctuation in your dictionary, this will >>> make the speller complain on all commas, colons, hyphens etc. >>> >>>> fsa.dict.input-conversion=ij ij, IJ IJ >>> >>> You need to use normal Unicode here or Java escaping, not HTML >>> escaping. >>> >>>> fsa.dict.output-conversion=ij ij, IJ IJ >>> Do you have such characters in the dictionary file? If not, then you >>> don't need the output conversion. >>> >>>> fsa.dict.speller.runon-words=false >>>> fsa.dict.speller.locale=nl_NL >>>> fsa.dict.speller.convert-case=false >>>> fsa.dict.speller.ignore-diacritics=true >>>> fsa.dict.speller.replacement-pairs=y ij, ei ij >>>> fsa.dict.speller.equivalent-chars= >>>> fsa.dict.frequency-included=true >>>> fsa.dict.encoding=utf-8 >>>> fsa.dict.separator= >>>> fsa.dict.author=R. Baars; >>>> >>>> I am not sure about separator , equivalent chars and the locale. >>> Separator is just used for internal management (usually it's a plus >>> character). Doesn't really matter unless you want to use "+" as an >>> entry >>> (and you would have to if you have "ignore-punctuation" set to false). >>> >>>> I don quite get the difference between diacritics, equivalent chars >>>> and >>>> replacment pairs. Diacritics seems to me to be part of equivalent and >>>> is >>>> a >>>> kind of automatic replacement. >>> Diacritics is automatic and faster than replacement pairs. Roughly the >>> same as equivalent chars. >>> >>>> ei ij is a replacement, á and a are taken care of by diacritics, >>>> and I >>>> guess Dutch does not have equivalents ... >>>> >>>> Right? >>> What about apostrophes? Do you want them normalized or not? >>> >>> Regards, >>> Marcin >>> >>>> >>>> >>>> >>>>> W dniu 2014-09-03 10:58, R.J. Baars pisze: >>>>>> To add the words frequencis, I am directed by the wiki to an address >>>>>> where >>>>>> there is a frequency list indeed. But only 187000 words; while I >>>>>> have >>>>>> 1.2 >>>>>> million Dutch words and their frequency myself. >>>>> Probably the probabilities of their occurrence is quite low. I tried >>>>> replacing that list with a bigger one for Polish and my results >>>>> indeed >>>>> made the dictionary file bigger but nothing else changed much. >>>>> >>>>>> The frequency is just a number; what is expected there? I this >>>>>> number >>>>>> a >>>>>> plain ratio, a occurrence count, or something else, like >>>>>> logarithmic? >>>>>> Will I have to convert to that format, or is a plain word<tab>number >>>>>> an >>>>>> option too? >>>>> Log scale, I believe. You might want to filter out some of the lower >>>>> results, as well, as they don't really help and only make files >>>>> bigger. >>>>> >>>>> Marcin >>>>> >>>>>> Ruud >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> Slashdot TV. >>>>>> Video for Nerds. Stuff that matters. >>>>>> http://tv.slashdot.org/ >>>>>> _______________________________________________ >>>>>> Languagetool-devel mailing list >>>>>> Languagetool-devel@lists.sourceforge.net >>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>>>> >>>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> Slashdot TV. >>>>> Video for Nerds. Stuff that matters. >>>>> http://tv.slashdot.org/ >>>>> _______________________________________________ >>>>> Languagetool-devel mailing list >>>>> Languagetool-devel@lists.sourceforge.net >>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Slashdot TV. >>>> Video for Nerds. Stuff that matters. >>>> http://tv.slashdot.org/ >>>> _______________________________________________ >>>> Languagetool-devel mailing list >>>> Languagetool-devel@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>> >>>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Slashdot TV. >>> Video for Nerds. Stuff that matters. >>> http://tv.slashdot.org/ >>> _______________________________________________ >>> Languagetool-devel mailing list >>> Languagetool-devel@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>> >> >> >> >> ------------------------------------------------------------------------------ >> Slashdot TV. >> Video for Nerds. Stuff that matters. >> http://tv.slashdot.org/ >> _______________________________________________ >> Languagetool-devel mailing list >> Languagetool-devel@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >> >> > > > ------------------------------------------------------------------------------ > Slashdot TV. > Video for Nerds. Stuff that matters. > http://tv.slashdot.org/ > _______________________________________________ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > ------------------------------------------------------------------------------ Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/ _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel