[lingu-dev] Hunspell 1.0.9

nemeth Fri, 26 Aug 2005 08:23:14 -0700

Hi,

Hunspell 1.0.9 is ready to integrate with OpenOffice.org!


(https://sourceforge.net/project/showfiles.php?group_id=143754)

Main changes:

- Added OOo makefile.in and improved UNO modul

- Myspell 3.0 has a wonderful ngram suggestion algorithm implemented
by Kevin Hendricks. It might be improved in certain cases. Now it works
with Unicode encoding and it has much better support for languages with
rich morphology (with lots of affixes).

- a lot of bug fixes and improvements (see release notes and changelog)

- successfully checked Hunspell on all OpenOffice.org dictionaries with
  improved unmunch. For dictionary developers: detected a lot of
  dictionary bugs in OOo dictionaries. (see release notes)

Bram Moolenaar, author of the excellent Vim editor, revised
Hunspell's code, and he has suggested a lot of improvements
for the future versions of Hunspell.

Many thanks to Kevin, Bram and other contributors!

Laci

Release notes
-------------

* improved related character map suggestion

* improved ngram suggestion

------ examples for ngram improvement (O=old, N = new ngram suggestions) --

1. Permenant (instead of Permanent)

O: Endangerment, Ferment, Fermented, Deferment's, Empowerment,
        Ferment's, Ferments, Fermenting, Countermen, Weathermen

N: Permanent, Supermen, Preferment

Note: Ngram suggestions was case sensitive.

2. permenant (instead of permanent)

O: supermen, newspapermen, empowerment, endangerment, preferments,
        preferment, permanent, preferment's, permanently, impermanent

N: permanent, supermen, preferment

Note: new suggestions are also weighted with longest common subsequence,
first letter and common character positions

3. pernemant (instead of permanent)

O: pimpernel's, pimpernel, pimpernels, permanently, permanents, permanent,
        supernatant, impermanent, semipermanent, impermanently

N: permanent, supernatant, pimpernel

Note: new method also prefers root word instead of not
relevant affixes ('s, s and ly)


4. pernament (instead of permanent)

O: tournaments, tournament, ornaments, ornament's, ornamenting, ornamented,
        ornament, ornamentals, ornamental, ornamentally

N: ornamental, ornament, tournament

Note: Both ngram methods misses here.


5. obvus (instad of obvious):

O: obvious, Corvus, obverse, obviously, Jacobus, obtuser, obtuse,
        obviates, obviate, Travus

N: obvious, obtuse, obverse

Note: new method also prefers common first letters.


6. unambigus (instead of unambiguous)

O: unambiguous, unambiguity, unambiguously, ambiguously, ambiguous,
        unambitious, ambiguities, ambiguousness

N: unambiguous, unambiguity, unambitious



7. consecvence (instead of consequence)

O: consecutive, consecutively, consecutiveness, nonconsecutive, consequence,
        consecutiveness's, convenience's, consistences, consistence

N: consequence, consecutive, consecrates


An example in a language with rich morphology:

8. Misisipiben (instead of Mississippiben [`in Mississippi' in Hungarian]):

O: Misikédéiben, Pisisedéiben, Misikéiéiben, Pisisekéiben, Misikéiben,
        Misikéidéiben, Misikékéiben, Misikéikéiben, Misikéiméiben,
Mississippiiben

N: Mississippiben, Mississippiiben, Misiiben

Note: Suggesting not relevant affixes was the biggest fault in ngram
   suggestion for languages with a lot of affixes.

--------------- end of examples --------------------

* support twofold prefix cutting

* lots of other improvements and bug fixes (see ChangeLog)

* test Hunspell with 54 OpenOffice.org dictionaries:

source:
ftp://ftp.services.openoffice.org/pub/OpenOffice.org/contrib/dictionaries

testing shell script:
-------------------------------------------------------
for i in `ls *zip | grep '^[a-z]*_[A-Z]*[.]'`
do
        dic=`basename $i .zip`
        mkdir $dic
        echo unzip $dic
        unzip -d $dic $i 2>/dev/null
        cd $dic
        echo unmunch and test $dic
        unmunch $dic.dic $dic.aff 2>/dev/null | awk '{print$0"\t"}' |
        hunspell -d $dic -l -1 >$dic.result 2>$dic.err || rm -f $dic.result
        cd ..
done
--------------------------------------------------------

test result (0 size is o.k.):

$ for i in *_*/*.result; do wc -c $i; done
0 af_ZA/af_ZA.result
0 bg_BG/bg_BG.result
0 ca_ES/ca_ES.result
0 cy_GB/cy_GB.result
0 cs_CZ/cs_CZ.result
0 da_DK/da_DK.result
0 de_AT/de_AT.result
0 de_CH/de_CH.result
0 de_DE/de_DE.result
0 el_GR/el_GR.result
6 en_AU/en_AU.result
0 en_CA/en_CA.result
0 en_GB/en_GB.result
0 en_NZ/en_NZ.result
0 en_US/en_US.result
0 eo_EO/eo_EO.result
0 es_ES/es_ES.result
0 es_MX/es_MX.result
0 es_NEW/es_NEW.result
0 fo_FO/fo_FO.result
0 fr_FR/fr_FR.result
0 ga_IE/ga_IE.result
0 gd_GB/gd_GB.result
0 gl_ES/gl_ES.result
0 he_IL/he_IL.result
0 hr_HR/hr_HR.result
200694989 hu_HU/hu_HU.result
0 id_ID/id_ID.result
0 it_IT/it_IT.result
0 ku_TR/ku_TR.result
0 lt_LT/lt_LT.result
0 lv_LV/lv_LV.result
0 mg_MG/mg_MG.result
0 mi_NZ/mi_NZ.result
0 ms_MY/ms_MY.result
0 nb_NO/nb_NO.result
0 nl_NL/nl_NL.result
0 nn_NO/nn_NO.result
0 ny_MW/ny_MW.result
0 pl_PL/pl_PL.result
0 pt_BR/pt_BR.result
0 pt_PT/pt_PT.result
0 ro_RO/ro_RO.result
0 ru_RU/ru_RU.result
0 rw_RW/rw_RW.result
0 sk_SK/sk_SK.result
0 sl_SI/sl_SI.result
0 sv_SE/sv_SE.result
0 sw_KE/sw_KE.result
0 tet_ID/tet_ID.result
0 tl_PH/tl_PH.result
0 tn_ZA/tn_ZA.result
0 uk_UA/uk_UA.result
0 zu_ZA/zu_ZA.result

In en_AU dictionary, there is an abbrevation with two dots (`eqn..'), but
`eqn.' is missing. Presumably it is a dictionary bug. Myspell also
haven't accepted it.

Hungarian dictionary contains pseudoroots and forbidden words.
Unmunch haven't supported these features yet, and generates bad words, too.

* check affix rules and OOo dictionaries. Detected bugs in cs_CZ,
es_ES, es_NEW, es_MX, lt_LT, nn_NO, pt_PT, ro_RO, sk_SK and sv_SE dictionaries).

Details:
--------------------------------------------------------
cs_CZ
warning - incompatible stripping characters and condition:
SFX D   us          ech        [^ighk]os
SFX D   us          y          [^i]os
SFX Q   os          ech        [^ghk]es
SFX M   o           ech        [^ghkei]a
SFX J   ém          ej         ám
SFX J   ém          ejme       ám
SFX J   ém          ejte       ám
SFX A   oužit       up         oupit
SFX A   oužit       upme       oupit
SFX A   oužit       upte       oupit
SFX A   nout        l          [aeiouyáéíóúýůěr][^aeiouyáéíóúýůěrl][^aeiouy
SFX A   nout        l          [aeiouyáéíóúýůěr][^aeiouyáéíóúýůěrl][^aeiouy

es_ES
warning - incompatible stripping characters and condition:
SFX W umar úse [ae]husar
SFX W emir ińáis eńir

es_NEW
warning - incompatible stripping characters and condition:
SFX I unan únen unar

es_MX
warning - incompatible stripping characters and condition:
SFX A a ote e
SFX W umar úse [ae]husar
SFX W emir ińáis eńir

lt_LT
warning - incompatible stripping characters and condition:
SFX U ti      siuosi          tis
SFX U ti      siuosi          tis
SFX U ti      siesi           tis
SFX U ti      siesi           tis
SFX U ti      sis             tis
SFX U ti      sis             tis
SFX U ti      simës           tis
SFX U ti      simës           tis
SFX U ti      sitës           tis
SFX U ti      sitës           tis

nn_NO
warning - incompatible stripping characters and condition:
SFX D   ar  rar  [^fmk]er
SFX U   Řre  orde  ere
SFX U   Řre  ort  ere

pt_PT
warning - incompatible stripping characters and condition:
SFX g   ăos        oas        ăo
SFX g   ăos        oas        ăo

ro_RO
warning - bad field number:
SFX L   0          le         [^cg] i
SFX L   0          i          [cg] i
SFX U   0          i          [^i] ii
warning - incompatible stripping characters and condition:
SFX P   l          i          l [<- there is an unnecessary tabulator here)
SFX I   a          ii         [gc] a
warning - bad field number:
SFX I   a          ii         [gc] a
SFX I   a          ei         [^cg] a

sk_SK
warning - incompatible stripping characters and condition:
SFX T   ľať         olú        klať
SFX T   ľať         olúc       klať
SFX T   sľať        šlú        slať
SFX T   sľať        šlúc       slať
SFX R   ľcť         lčiem      ĺcť
SFX R   iásť        ätie       miasť
SFX R   iezť        iem        [^i]ezť
SFX R   iezť        ieš        [^i]ezť
SFX R   iezť        ie         [^i]ezť
SFX R   iezť        eme        [^i]ezť
SFX R   iezť        ete        [^i]ezť
SFX R   iezť        ú          [^i]ezť
SFX R   iezť        úc         [^i]ezť
SFX R   iezť        z          [^i]ezť
SFX R   iezť        me         [^i]ezť
SFX R   iezť        te         [^i]ezť

sv_SE
warning - bad field number:
SFX  C  0  net  nets [^e]n
--------------------------------------------------------


ChangeLog
---------

improvements:

        * src/hunspell/suggestmgr.cxx:
          Unicode support in related character map suggestion

        * src/hunspell/suggestmgr.cxx: Unicode support in ngram suggestion

        * src/hunspell/{suggestmgr,affixmgr,hunspell}.cxx: improve ngram 
suggestion.
          Fix http://qa.openoffice.org/issues/show_bug.cgi?id=35725. See release
          notes for examples. This problem reported by beccablain at
OpenOffice.org.
        - ngram suggestions now are case insensitive (see `Permenant' bug in
Issuezilla)
        - weight ngram suggestions (with the longest common subsequent
algorithm,
          also considering lengths of bad word and suggestion, identical first
          letters and almost completely identical character positions)
        - set strict affix congruency in expand_rootword(). Now ngram
suggestions
          are good for languages with rich morphology and also better for
English.
          Rationale: affixed forms of the first ngram suggestion
          very often suppress the second and subsequent root word suggestions.
But
          faults in affixes are more uncommon, and can be fix without
suggestions.
          We must prefer the more informative second and subsequent root word
          suggestions instead of the suggestions for bad affixes.
        - a better suggestion may not be substring of a less good suggestion
          Rationale: Suggesting affixed forms of a root word is
          unnecessary, when root word has got better weighted ngram value.
          (Checking substrings is a good approximation for this refinement.)
        - lesser ngram suggestions (default 3 maximum instead of 10)
          Rationale: For users need a big extra effort to check a lot of bad
ngram
          suggestions, nine times out of ten unnecessarily. It is very
          distracting, because ngram suggestions could be very different.
          Usually Myspell and Hunspell suggest one or two suggestions with
          the old suggestion algorithms (maximum is 15), with ngram algorithm
          often gives maximum number suggestions. With strict affix congruency
          and other refinements, the good suggestion there is usually among the
          first three elements.
        - new affix parameter: MAXNGRAMSUG

        * src/hunspell/*: support agglutinative languages with rich prefix
          morphology or with right-to-left writing system (for example, Turkic
          and Austronesian languages with (modified) Arabic scripts).
        - new affix parameter: COMPLEXPREFIXES
          Set twofold prefix stripping (but single suffix stripping)
        * src/hunspell/affixmgr.cxx:
        - speed up prefix loading with tree sorting algorithm.
        * tests/complexprefixes.*, tests/complexprefixesutf.*:
          Coptic example posted by Moheb Mekhaiel

        * src/hunspell/hashmgr.cxx: check size attribute in dic file
          suggested by Daniel Naber
          Rationale: With missing size attribute Hunspell allocates too small 
and
          more slower hash memory, and Hunspell can lose first dictionary word.

        * src/hunspell/affixmgr.cxx: check stripping characters and condition
          compatibility in affix rules (bugs detected in cs_CZ, es_ES, es_NEW,
          es_MX, lt_LT, nn_NO, pt_PT, ro_RO and sk_SK dictionaries). See release
          notes of Hunspell 1.0.9 in NEWS.

        * src/hunspell/affixmgr.cxx: check unnecessary fields in affix rules
          (bugs detected in ro_RO and sv_SE dictionaries). See release notes.

        * src/hunspell/affixmgr.cxx: remove redundant condition checking
          in affix rules with stripping characters (redundancy in OpenOffice.org
          dictionaries reported by EleonĂłra Goldman)
          Rationale: this is a little optimization, but it was excellent for
          detect the bad ngram affixation with bad or weak affix conditions.

        * tests/germancompounding.aff: improve compound definition
        - use dash prefix instead of language specific tokenizer
          Rationale: Using uniform approach is the right way to check and
analyze
          compound words. Language specific word breaking is deprecated, need
          a sophisticated grammar checking for word-like word pairs
          (for example in Hungarian there is a substandard, but accepted
          syntax with dash for word pairs: cats, dogs -> kutyĂĄk-macskĂĄk (like
          cats/dogs in English).

        * test Hunspell with 54 OpenOffice.org dictionaries: see release notes

bug fixes:

        * src/hunspell/suggestmgr.*: add time limit to exponential
          algorithm of the related character map suggestion
          Rationale: a long word in agglutinative languages or a special pattern
          (for example a horizontal rule) made of map characters can `crash' the
          spell checker.

        * src/hunspell/affentry.cxx: add() functions: fix bad word generation
          checking stripping characters (see similar bug in unmunch)

        * src/hunspell/affixmgr.cxx: parse_file(): fix unconditional getNext()
          call for ~AffixMgr() when affix file is corrupt.

        * src/hunspell/affixmgr.*: AffixMgr(), parse_cpdsyllable(): fix missing
          string duplications for ~AffixMgr() when affix file is corrupt.

        * src/hunspell/affixmgr.*: parse_affix(): fix fprintf() call when affix
          file is corrupt. Bug reported by Daniel Naber.

        * suggestmgr.cxx: replace single usage of 'strdup' with 'mystrdup'
          patch by Chris Halls (debian.org)

        * src/hunspell/makefile.mk: add makefile.mk for compiling in 
OpenOffice.org
          See README in Hunspell UNO modul.
          Problems with separated compiling reported by Rene Engelhard

        * src/hunspell/hunspell.cxx: fix pseudoroot support
        - search a not pseudoroot homonym in check()
        * tests/pseudoroot4.*: test this fix

        * src/tools/unmunch.c: fix bad word generation when conditions
          are shorter or incompatible with stripping characters in affix rules

        * src/tools/unmunch.c: fix mychomp() for de_AT.dic and other dic files
          without last new line character.

other changes:
        * src/hunspell/suggestmgr.*: erase ACCENT suggestion
          Rationale: ACCENT suggestion was the same as Kevin Hendrick's map
          suggestion algorithm, but with a less good interface in affix file.

        * src/hunspell/suggestmgr.*: combine cycle number limit
          in badchar(), and forgotchar() with a time limit.

        * src/hunspell/affixmgr.*: remove NOMAPSUGS affix parameter

        * src/hunspell/{suggestmgr,hunspell}.*: strip periods from
          suggestions (restore MySpell's original behaviour)
          Rationale: OpenOffice.org has an automatic period handling mechanism
          and suggestions look better without periods.
        - new affix file parameter: SUGSWITHDOTS
          Add period(s) to suggestions, if input word terminates in period(s).
          (No need for OpenOffice.org dictionaries.)

        * tests/germancompounding.aff: improve bad german affix in affix example
          (computeren->computern). Suggested by Daniel Naber.

        * src/tools/example.cxx: add Myspell's example

        * src/tools/munch.cxx: add Myspell's munch

        * man{,/hu}/hunspell.4: refresh manual pages


----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[lingu-dev] Hunspell 1.0.9

Reply via email to