> W dniu 2014-09-03 12:30, R.J. Baars pisze: >> Marcin, >> >> For English, there are .info files in /resource/ as well as in >> /resource/hunspell. >> First seems to be for the tagging dict, second for the speller. > Ah, of course, there should be one .info file per one .dict file. I > thought you were asking about one dictionary file. > >> >> (I would prefer spell-checker for directory name.) >> >> The content of the info file for Dutch should probably be: >> fsa.dict.speller.ignore-numbers=false >> fsa.dict.speller.ignore-all-uppercase=false >> fsa.dict.speller.ignore-camel-case=true >> fsa.dict.speller.ignore-punctuation=false > Note: if you don't have all punctuation in your dictionary, this will > make the speller complain on all commas, colons, hyphens etc. > >> fsa.dict.input-conversion=ij ij, IJ IJ
> > You need to use normal Unicode here or Java escaping, not HTML escaping. This was cause by email conversion ;-) > >> fsa.dict.output-conversion=ij ij, IJ IJ > Do you have such characters in the dictionary file? If not, then you > don't need the output conversion. I need to make sure that a word like IJmuiden (place) is never accepted as Ijmuiden. In Hunspell, I converted every incoming ij into the ligature, and back going out, to make that possible. > >> fsa.dict.speller.runon-words=false >> fsa.dict.speller.locale=nl_NL >> fsa.dict.speller.convert-case=false >> fsa.dict.speller.ignore-diacritics=true >> fsa.dict.speller.replacement-pairs=y ij, ei ij >> fsa.dict.speller.equivalent-chars= >> fsa.dict.frequency-included=true >> fsa.dict.encoding=utf-8 >> fsa.dict.separator= >> fsa.dict.author=R. Baars; >> >> I am not sure about separator , equivalent chars and the locale. > Separator is just used for internal management (usually it's a plus > character). Doesn't really matter unless you want to use "+" as an entry > (and you would have to if you have "ignore-punctuation" set to false). > >> I don quite get the difference between diacritics, equivalent chars and >> replacment pairs. Diacritics seems to me to be part of equivalent and is >> a >> kind of automatic replacement. > Diacritics is automatic and faster than replacement pairs. Roughly the > same as equivalent chars. > >> ei ij is a replacement, á and a are taken care of by diacritics, and I >> guess Dutch does not have equivalents ... >> >> Right? > What about apostrophes? Do you want them normalized or not? Yes I guess I would ... > > Regards, > Marcin > >> >> >> >>> W dniu 2014-09-03 10:58, R.J. Baars pisze: >>>> To add the words frequencis, I am directed by the wiki to an address >>>> where >>>> there is a frequency list indeed. But only 187000 words; while I have >>>> 1.2 >>>> million Dutch words and their frequency myself. >>> Probably the probabilities of their occurrence is quite low. I tried >>> replacing that list with a bigger one for Polish and my results indeed >>> made the dictionary file bigger but nothing else changed much. >>> >>>> The frequency is just a number; what is expected there? I this number >>>> a >>>> plain ratio, a occurrence count, or something else, like logarithmic? >>>> Will I have to convert to that format, or is a plain word<tab>number >>>> an >>>> option too? >>> Log scale, I believe. You might want to filter out some of the lower >>> results, as well, as they don't really help and only make files bigger. >>> >>> Marcin >>> >>>> Ruud >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Slashdot TV. >>>> Video for Nerds. Stuff that matters. >>>> http://tv.slashdot.org/ >>>> _______________________________________________ >>>> Languagetool-devel mailing list >>>> Languagetool-devel@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>>> >>>> >>> >>> ------------------------------------------------------------------------------ >>> Slashdot TV. >>> Video for Nerds. Stuff that matters. >>> http://tv.slashdot.org/ >>> _______________________________________________ >>> Languagetool-devel mailing list >>> Languagetool-devel@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>> >> >> >> ------------------------------------------------------------------------------ >> Slashdot TV. >> Video for Nerds. Stuff that matters. >> http://tv.slashdot.org/ >> _______________________________________________ >> Languagetool-devel mailing list >> Languagetool-devel@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >> >> > > > ------------------------------------------------------------------------------ > Slashdot TV. > Video for Nerds. Stuff that matters. > http://tv.slashdot.org/ > _______________________________________________ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > ------------------------------------------------------------------------------ Slashdot TV. Video for Nerds. Stuff that matters. http://tv.slashdot.org/ _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel