Hi, Hunspell 1.0.9 is ready to integrate with OpenOffice.org!
(https://sourceforge.net/project/showfiles.php?group_id=143754) Main changes: - Added OOo makefile.in and improved UNO modul - Myspell 3.0 has a wonderful ngram suggestion algorithm implemented by Kevin Hendricks. It might be improved in certain cases. Now it works with Unicode encoding and it has much better support for languages with rich morphology (with lots of affixes). - a lot of bug fixes and improvements (see release notes and changelog) - successfully checked Hunspell on all OpenOffice.org dictionaries with improved unmunch. For dictionary developers: detected a lot of dictionary bugs in OOo dictionaries. (see release notes) Bram Moolenaar, author of the excellent Vim editor, revised Hunspell's code, and he has suggested a lot of improvements for the future versions of Hunspell. Many thanks to Kevin, Bram and other contributors! Laci Release notes ------------- * improved related character map suggestion * improved ngram suggestion ------ examples for ngram improvement (O=old, N = new ngram suggestions) -- 1. Permenant (instead of Permanent) O: Endangerment, Ferment, Fermented, Deferment's, Empowerment, Ferment's, Ferments, Fermenting, Countermen, Weathermen N: Permanent, Supermen, Preferment Note: Ngram suggestions was case sensitive. 2. permenant (instead of permanent) O: supermen, newspapermen, empowerment, endangerment, preferments, preferment, permanent, preferment's, permanently, impermanent N: permanent, supermen, preferment Note: new suggestions are also weighted with longest common subsequence, first letter and common character positions 3. pernemant (instead of permanent) O: pimpernel's, pimpernel, pimpernels, permanently, permanents, permanent, supernatant, impermanent, semipermanent, impermanently N: permanent, supernatant, pimpernel Note: new method also prefers root word instead of not relevant affixes ('s, s and ly) 4. pernament (instead of permanent) O: tournaments, tournament, ornaments, ornament's, ornamenting, ornamented, ornament, ornamentals, ornamental, ornamentally N: ornamental, ornament, tournament Note: Both ngram methods misses here. 5. obvus (instad of obvious): O: obvious, Corvus, obverse, obviously, Jacobus, obtuser, obtuse, obviates, obviate, Travus N: obvious, obtuse, obverse Note: new method also prefers common first letters. 6. unambigus (instead of unambiguous) O: unambiguous, unambiguity, unambiguously, ambiguously, ambiguous, unambitious, ambiguities, ambiguousness N: unambiguous, unambiguity, unambitious 7. consecvence (instead of consequence) O: consecutive, consecutively, consecutiveness, nonconsecutive, consequence, consecutiveness's, convenience's, consistences, consistence N: consequence, consecutive, consecrates An example in a language with rich morphology: 8. Misisipiben (instead of Mississippiben [`in Mississippi' in Hungarian]): O: Misikédéiben, Pisisedéiben, Misikéiéiben, Pisisekéiben, Misikéiben, Misikéidéiben, Misikékéiben, Misikéikéiben, Misikéiméiben, Mississippiiben N: Mississippiben, Mississippiiben, Misiiben Note: Suggesting not relevant affixes was the biggest fault in ngram suggestion for languages with a lot of affixes. --------------- end of examples -------------------- * support twofold prefix cutting * lots of other improvements and bug fixes (see ChangeLog) * test Hunspell with 54 OpenOffice.org dictionaries: source: ftp://ftp.services.openoffice.org/pub/OpenOffice.org/contrib/dictionaries testing shell script: ------------------------------------------------------- for i in `ls *zip | grep '^[a-z]*_[A-Z]*[.]'` do dic=`basename $i .zip` mkdir $dic echo unzip $dic unzip -d $dic $i 2>/dev/null cd $dic echo unmunch and test $dic unmunch $dic.dic $dic.aff 2>/dev/null | awk '{print$0"\t"}' | hunspell -d $dic -l -1 >$dic.result 2>$dic.err || rm -f $dic.result cd .. done -------------------------------------------------------- test result (0 size is o.k.): $ for i in *_*/*.result; do wc -c $i; done 0 af_ZA/af_ZA.result 0 bg_BG/bg_BG.result 0 ca_ES/ca_ES.result 0 cy_GB/cy_GB.result 0 cs_CZ/cs_CZ.result 0 da_DK/da_DK.result 0 de_AT/de_AT.result 0 de_CH/de_CH.result 0 de_DE/de_DE.result 0 el_GR/el_GR.result 6 en_AU/en_AU.result 0 en_CA/en_CA.result 0 en_GB/en_GB.result 0 en_NZ/en_NZ.result 0 en_US/en_US.result 0 eo_EO/eo_EO.result 0 es_ES/es_ES.result 0 es_MX/es_MX.result 0 es_NEW/es_NEW.result 0 fo_FO/fo_FO.result 0 fr_FR/fr_FR.result 0 ga_IE/ga_IE.result 0 gd_GB/gd_GB.result 0 gl_ES/gl_ES.result 0 he_IL/he_IL.result 0 hr_HR/hr_HR.result 200694989 hu_HU/hu_HU.result 0 id_ID/id_ID.result 0 it_IT/it_IT.result 0 ku_TR/ku_TR.result 0 lt_LT/lt_LT.result 0 lv_LV/lv_LV.result 0 mg_MG/mg_MG.result 0 mi_NZ/mi_NZ.result 0 ms_MY/ms_MY.result 0 nb_NO/nb_NO.result 0 nl_NL/nl_NL.result 0 nn_NO/nn_NO.result 0 ny_MW/ny_MW.result 0 pl_PL/pl_PL.result 0 pt_BR/pt_BR.result 0 pt_PT/pt_PT.result 0 ro_RO/ro_RO.result 0 ru_RU/ru_RU.result 0 rw_RW/rw_RW.result 0 sk_SK/sk_SK.result 0 sl_SI/sl_SI.result 0 sv_SE/sv_SE.result 0 sw_KE/sw_KE.result 0 tet_ID/tet_ID.result 0 tl_PH/tl_PH.result 0 tn_ZA/tn_ZA.result 0 uk_UA/uk_UA.result 0 zu_ZA/zu_ZA.result In en_AU dictionary, there is an abbrevation with two dots (`eqn..'), but `eqn.' is missing. Presumably it is a dictionary bug. Myspell also haven't accepted it. Hungarian dictionary contains pseudoroots and forbidden words. Unmunch haven't supported these features yet, and generates bad words, too. * check affix rules and OOo dictionaries. Detected bugs in cs_CZ, es_ES, es_NEW, es_MX, lt_LT, nn_NO, pt_PT, ro_RO, sk_SK and sv_SE dictionaries). Details: -------------------------------------------------------- cs_CZ warning - incompatible stripping characters and condition: SFX D us ech [^ighk]os SFX D us y [^i]os SFX Q os ech [^ghk]es SFX M o ech [^ghkei]a SFX J ém ej ám SFX J ém ejme ám SFX J ém ejte ám SFX A oužit up oupit SFX A oužit upme oupit SFX A oužit upte oupit SFX A nout l [aeiouyáéíóúýůěr][^aeiouyáéíóúýůěrl][^aeiouy SFX A nout l [aeiouyáéíóúýůěr][^aeiouyáéíóúýůěrl][^aeiouy es_ES warning - incompatible stripping characters and condition: SFX W umar úse [ae]husar SFX W emir ińáis eńir es_NEW warning - incompatible stripping characters and condition: SFX I unan únen unar es_MX warning - incompatible stripping characters and condition: SFX A a ote e SFX W umar úse [ae]husar SFX W emir ińáis eńir lt_LT warning - incompatible stripping characters and condition: SFX U ti siuosi tis SFX U ti siuosi tis SFX U ti siesi tis SFX U ti siesi tis SFX U ti sis tis SFX U ti sis tis SFX U ti simës tis SFX U ti simës tis SFX U ti sitës tis SFX U ti sitës tis nn_NO warning - incompatible stripping characters and condition: SFX D ar rar [^fmk]er SFX U Řre orde ere SFX U Řre ort ere pt_PT warning - incompatible stripping characters and condition: SFX g ăos oas ăo SFX g ăos oas ăo ro_RO warning - bad field number: SFX L 0 le [^cg] i SFX L 0 i [cg] i SFX U 0 i [^i] ii warning - incompatible stripping characters and condition: SFX P l i l [<- there is an unnecessary tabulator here) SFX I a ii [gc] a warning - bad field number: SFX I a ii [gc] a SFX I a ei [^cg] a sk_SK warning - incompatible stripping characters and condition: SFX T ľať olú klať SFX T ľať olúc klať SFX T sľať šlú slať SFX T sľať šlúc slať SFX R ľcť lčiem ĺcť SFX R iásť ätie miasť SFX R iezť iem [^i]ezť SFX R iezť ieš [^i]ezť SFX R iezť ie [^i]ezť SFX R iezť eme [^i]ezť SFX R iezť ete [^i]ezť SFX R iezť ú [^i]ezť SFX R iezť úc [^i]ezť SFX R iezť z [^i]ezť SFX R iezť me [^i]ezť SFX R iezť te [^i]ezť sv_SE warning - bad field number: SFX C 0 net nets [^e]n -------------------------------------------------------- ChangeLog --------- improvements: * src/hunspell/suggestmgr.cxx: Unicode support in related character map suggestion * src/hunspell/suggestmgr.cxx: Unicode support in ngram suggestion * src/hunspell/{suggestmgr,affixmgr,hunspell}.cxx: improve ngram suggestion. Fix http://qa.openoffice.org/issues/show_bug.cgi?id=35725. See release notes for examples. This problem reported by beccablain at OpenOffice.org. - ngram suggestions now are case insensitive (see `Permenant' bug in Issuezilla) - weight ngram suggestions (with the longest common subsequent algorithm, also considering lengths of bad word and suggestion, identical first letters and almost completely identical character positions) - set strict affix congruency in expand_rootword(). Now ngram suggestions are good for languages with rich morphology and also better for English. Rationale: affixed forms of the first ngram suggestion very often suppress the second and subsequent root word suggestions. But faults in affixes are more uncommon, and can be fix without suggestions. We must prefer the more informative second and subsequent root word suggestions instead of the suggestions for bad affixes. - a better suggestion may not be substring of a less good suggestion Rationale: Suggesting affixed forms of a root word is unnecessary, when root word has got better weighted ngram value. (Checking substrings is a good approximation for this refinement.) - lesser ngram suggestions (default 3 maximum instead of 10) Rationale: For users need a big extra effort to check a lot of bad ngram suggestions, nine times out of ten unnecessarily. It is very distracting, because ngram suggestions could be very different. Usually Myspell and Hunspell suggest one or two suggestions with the old suggestion algorithms (maximum is 15), with ngram algorithm often gives maximum number suggestions. With strict affix congruency and other refinements, the good suggestion there is usually among the first three elements. - new affix parameter: MAXNGRAMSUG * src/hunspell/*: support agglutinative languages with rich prefix morphology or with right-to-left writing system (for example, Turkic and Austronesian languages with (modified) Arabic scripts). - new affix parameter: COMPLEXPREFIXES Set twofold prefix stripping (but single suffix stripping) * src/hunspell/affixmgr.cxx: - speed up prefix loading with tree sorting algorithm. * tests/complexprefixes.*, tests/complexprefixesutf.*: Coptic example posted by Moheb Mekhaiel * src/hunspell/hashmgr.cxx: check size attribute in dic file suggested by Daniel Naber Rationale: With missing size attribute Hunspell allocates too small and more slower hash memory, and Hunspell can lose first dictionary word. * src/hunspell/affixmgr.cxx: check stripping characters and condition compatibility in affix rules (bugs detected in cs_CZ, es_ES, es_NEW, es_MX, lt_LT, nn_NO, pt_PT, ro_RO and sk_SK dictionaries). See release notes of Hunspell 1.0.9 in NEWS. * src/hunspell/affixmgr.cxx: check unnecessary fields in affix rules (bugs detected in ro_RO and sv_SE dictionaries). See release notes. * src/hunspell/affixmgr.cxx: remove redundant condition checking in affix rules with stripping characters (redundancy in OpenOffice.org dictionaries reported by EleonĂłra Goldman) Rationale: this is a little optimization, but it was excellent for detect the bad ngram affixation with bad or weak affix conditions. * tests/germancompounding.aff: improve compound definition - use dash prefix instead of language specific tokenizer Rationale: Using uniform approach is the right way to check and analyze compound words. Language specific word breaking is deprecated, need a sophisticated grammar checking for word-like word pairs (for example in Hungarian there is a substandard, but accepted syntax with dash for word pairs: cats, dogs -> kutyĂĄk-macskĂĄk (like cats/dogs in English). * test Hunspell with 54 OpenOffice.org dictionaries: see release notes bug fixes: * src/hunspell/suggestmgr.*: add time limit to exponential algorithm of the related character map suggestion Rationale: a long word in agglutinative languages or a special pattern (for example a horizontal rule) made of map characters can `crash' the spell checker. * src/hunspell/affentry.cxx: add() functions: fix bad word generation checking stripping characters (see similar bug in unmunch) * src/hunspell/affixmgr.cxx: parse_file(): fix unconditional getNext() call for ~AffixMgr() when affix file is corrupt. * src/hunspell/affixmgr.*: AffixMgr(), parse_cpdsyllable(): fix missing string duplications for ~AffixMgr() when affix file is corrupt. * src/hunspell/affixmgr.*: parse_affix(): fix fprintf() call when affix file is corrupt. Bug reported by Daniel Naber. * suggestmgr.cxx: replace single usage of 'strdup' with 'mystrdup' patch by Chris Halls (debian.org) * src/hunspell/makefile.mk: add makefile.mk for compiling in OpenOffice.org See README in Hunspell UNO modul. Problems with separated compiling reported by Rene Engelhard * src/hunspell/hunspell.cxx: fix pseudoroot support - search a not pseudoroot homonym in check() * tests/pseudoroot4.*: test this fix * src/tools/unmunch.c: fix bad word generation when conditions are shorter or incompatible with stripping characters in affix rules * src/tools/unmunch.c: fix mychomp() for de_AT.dic and other dic files without last new line character. other changes: * src/hunspell/suggestmgr.*: erase ACCENT suggestion Rationale: ACCENT suggestion was the same as Kevin Hendrick's map suggestion algorithm, but with a less good interface in affix file. * src/hunspell/suggestmgr.*: combine cycle number limit in badchar(), and forgotchar() with a time limit. * src/hunspell/affixmgr.*: remove NOMAPSUGS affix parameter * src/hunspell/{suggestmgr,hunspell}.*: strip periods from suggestions (restore MySpell's original behaviour) Rationale: OpenOffice.org has an automatic period handling mechanism and suggestions look better without periods. - new affix file parameter: SUGSWITHDOTS Add period(s) to suggestions, if input word terminates in period(s). (No need for OpenOffice.org dictionaries.) * tests/germancompounding.aff: improve bad german affix in affix example (computeren->computern). Suggested by Daniel Naber. * src/tools/example.cxx: add Myspell's example * src/tools/munch.cxx: add Myspell's munch * man{,/hu}/hunspell.4: refresh manual pages ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
