Hi,
Hunspell 1.0.9 is ready to integrate with OpenOffice.org!
(https://sourceforge.net/project/showfiles.php?group_id=143754)
Main changes:
- Added OOo makefile.in and improved UNO modul
- Myspell 3.0 has a wonderful ngram suggestion algorithm implemented
by Kevin Hendricks. It might be improved in certain cases. Now it works
with Unicode encoding and it has much better support for languages with
rich morphology (with lots of affixes).
- a lot of bug fixes and improvements (see release notes and changelog)
- successfully checked Hunspell on all OpenOffice.org dictionaries with
improved unmunch. For dictionary developers: detected a lot of
dictionary bugs in OOo dictionaries. (see release notes)
Bram Moolenaar, author of the excellent Vim editor, revised
Hunspell's code, and he has suggested a lot of improvements
for the future versions of Hunspell.
Many thanks to Kevin, Bram and other contributors!
Laci
Release notes
-------------
* improved related character map suggestion
* improved ngram suggestion
------ examples for ngram improvement (O=old, N = new ngram suggestions) --
1. Permenant (instead of Permanent)
O: Endangerment, Ferment, Fermented, Deferment's, Empowerment,
Ferment's, Ferments, Fermenting, Countermen, Weathermen
N: Permanent, Supermen, Preferment
Note: Ngram suggestions was case sensitive.
2. permenant (instead of permanent)
O: supermen, newspapermen, empowerment, endangerment, preferments,
preferment, permanent, preferment's, permanently, impermanent
N: permanent, supermen, preferment
Note: new suggestions are also weighted with longest common subsequence,
first letter and common character positions
3. pernemant (instead of permanent)
O: pimpernel's, pimpernel, pimpernels, permanently, permanents, permanent,
supernatant, impermanent, semipermanent, impermanently
N: permanent, supernatant, pimpernel
Note: new method also prefers root word instead of not
relevant affixes ('s, s and ly)
4. pernament (instead of permanent)
O: tournaments, tournament, ornaments, ornament's, ornamenting, ornamented,
ornament, ornamentals, ornamental, ornamentally
N: ornamental, ornament, tournament
Note: Both ngram methods misses here.
5. obvus (instad of obvious):
O: obvious, Corvus, obverse, obviously, Jacobus, obtuser, obtuse,
obviates, obviate, Travus
N: obvious, obtuse, obverse
Note: new method also prefers common first letters.
6. unambigus (instead of unambiguous)
O: unambiguous, unambiguity, unambiguously, ambiguously, ambiguous,
unambitious, ambiguities, ambiguousness
N: unambiguous, unambiguity, unambitious
7. consecvence (instead of consequence)
O: consecutive, consecutively, consecutiveness, nonconsecutive, consequence,
consecutiveness's, convenience's, consistences, consistence
N: consequence, consecutive, consecrates
An example in a language with rich morphology:
8. Misisipiben (instead of Mississippiben [`in Mississippi' in Hungarian]):
O: Misikédéiben, Pisisedéiben, Misikéiéiben, Pisisekéiben, Misikéiben,
Misikéidéiben, Misikékéiben, Misikéikéiben, Misikéiméiben,
Mississippiiben
N: Mississippiben, Mississippiiben, Misiiben
Note: Suggesting not relevant affixes was the biggest fault in ngram
suggestion for languages with a lot of affixes.
--------------- end of examples --------------------
* support twofold prefix cutting
* lots of other improvements and bug fixes (see ChangeLog)
* test Hunspell with 54 OpenOffice.org dictionaries:
source:
ftp://ftp.services.openoffice.org/pub/OpenOffice.org/contrib/dictionaries
testing shell script:
-------------------------------------------------------
for i in `ls *zip | grep '^[a-z]*_[A-Z]*[.]'`
do
dic=`basename $i .zip`
mkdir $dic
echo unzip $dic
unzip -d $dic $i 2>/dev/null
cd $dic
echo unmunch and test $dic
unmunch $dic.dic $dic.aff 2>/dev/null | awk '{print$0"\t"}' |
hunspell -d $dic -l -1 >$dic.result 2>$dic.err || rm -f $dic.result
cd ..
done
--------------------------------------------------------
test result (0 size is o.k.):
$ for i in *_*/*.result; do wc -c $i; done
0 af_ZA/af_ZA.result
0 bg_BG/bg_BG.result
0 ca_ES/ca_ES.result
0 cy_GB/cy_GB.result
0 cs_CZ/cs_CZ.result
0 da_DK/da_DK.result
0 de_AT/de_AT.result
0 de_CH/de_CH.result
0 de_DE/de_DE.result
0 el_GR/el_GR.result
6 en_AU/en_AU.result
0 en_CA/en_CA.result
0 en_GB/en_GB.result
0 en_NZ/en_NZ.result
0 en_US/en_US.result
0 eo_EO/eo_EO.result
0 es_ES/es_ES.result
0 es_MX/es_MX.result
0 es_NEW/es_NEW.result
0 fo_FO/fo_FO.result
0 fr_FR/fr_FR.result
0 ga_IE/ga_IE.result
0 gd_GB/gd_GB.result
0 gl_ES/gl_ES.result
0 he_IL/he_IL.result
0 hr_HR/hr_HR.result
200694989 hu_HU/hu_HU.result
0 id_ID/id_ID.result
0 it_IT/it_IT.result
0 ku_TR/ku_TR.result
0 lt_LT/lt_LT.result
0 lv_LV/lv_LV.result
0 mg_MG/mg_MG.result
0 mi_NZ/mi_NZ.result
0 ms_MY/ms_MY.result
0 nb_NO/nb_NO.result
0 nl_NL/nl_NL.result
0 nn_NO/nn_NO.result
0 ny_MW/ny_MW.result
0 pl_PL/pl_PL.result
0 pt_BR/pt_BR.result
0 pt_PT/pt_PT.result
0 ro_RO/ro_RO.result
0 ru_RU/ru_RU.result
0 rw_RW/rw_RW.result
0 sk_SK/sk_SK.result
0 sl_SI/sl_SI.result
0 sv_SE/sv_SE.result
0 sw_KE/sw_KE.result
0 tet_ID/tet_ID.result
0 tl_PH/tl_PH.result
0 tn_ZA/tn_ZA.result
0 uk_UA/uk_UA.result
0 zu_ZA/zu_ZA.result
In en_AU dictionary, there is an abbrevation with two dots (`eqn..'), but
`eqn.' is missing. Presumably it is a dictionary bug. Myspell also
haven't accepted it.
Hungarian dictionary contains pseudoroots and forbidden words.
Unmunch haven't supported these features yet, and generates bad words, too.
* check affix rules and OOo dictionaries. Detected bugs in cs_CZ,
es_ES, es_NEW, es_MX, lt_LT, nn_NO, pt_PT, ro_RO, sk_SK and sv_SE dictionaries).
Details:
--------------------------------------------------------
cs_CZ
warning - incompatible stripping characters and condition:
SFX D us ech [^ighk]os
SFX D us y [^i]os
SFX Q os ech [^ghk]es
SFX M o ech [^ghkei]a
SFX J ém ej ám
SFX J ém ejme ám
SFX J ém ejte ám
SFX A oužit up oupit
SFX A oužit upme oupit
SFX A oužit upte oupit
SFX A nout l [aeiouyáéíóúýůěr][^aeiouyáéíóúýůěrl][^aeiouy
SFX A nout l [aeiouyáéíóúýůěr][^aeiouyáéíóúýůěrl][^aeiouy
es_ES
warning - incompatible stripping characters and condition:
SFX W umar úse [ae]husar
SFX W emir ińáis eńir
es_NEW
warning - incompatible stripping characters and condition:
SFX I unan únen unar
es_MX
warning - incompatible stripping characters and condition:
SFX A a ote e
SFX W umar úse [ae]husar
SFX W emir ińáis eńir
lt_LT
warning - incompatible stripping characters and condition:
SFX U ti siuosi tis
SFX U ti siuosi tis
SFX U ti siesi tis
SFX U ti siesi tis
SFX U ti sis tis
SFX U ti sis tis
SFX U ti simës tis
SFX U ti simës tis
SFX U ti sitës tis
SFX U ti sitës tis
nn_NO
warning - incompatible stripping characters and condition:
SFX D ar rar [^fmk]er
SFX U Řre orde ere
SFX U Řre ort ere
pt_PT
warning - incompatible stripping characters and condition:
SFX g ăos oas ăo
SFX g ăos oas ăo
ro_RO
warning - bad field number:
SFX L 0 le [^cg] i
SFX L 0 i [cg] i
SFX U 0 i [^i] ii
warning - incompatible stripping characters and condition:
SFX P l i l [<- there is an unnecessary tabulator here)
SFX I a ii [gc] a
warning - bad field number:
SFX I a ii [gc] a
SFX I a ei [^cg] a
sk_SK
warning - incompatible stripping characters and condition:
SFX T ľať olú klať
SFX T ľať olúc klať
SFX T sľať šlú slať
SFX T sľať šlúc slať
SFX R ľcť lčiem ĺcť
SFX R iásť ätie miasť
SFX R iezť iem [^i]ezť
SFX R iezť ieš [^i]ezť
SFX R iezť ie [^i]ezť
SFX R iezť eme [^i]ezť
SFX R iezť ete [^i]ezť
SFX R iezť ú [^i]ezť
SFX R iezť úc [^i]ezť
SFX R iezť z [^i]ezť
SFX R iezť me [^i]ezť
SFX R iezť te [^i]ezť
sv_SE
warning - bad field number:
SFX C 0 net nets [^e]n
--------------------------------------------------------
ChangeLog
---------
improvements:
* src/hunspell/suggestmgr.cxx:
Unicode support in related character map suggestion
* src/hunspell/suggestmgr.cxx: Unicode support in ngram suggestion
* src/hunspell/{suggestmgr,affixmgr,hunspell}.cxx: improve ngram
suggestion.
Fix http://qa.openoffice.org/issues/show_bug.cgi?id=35725. See release
notes for examples. This problem reported by beccablain at
OpenOffice.org.
- ngram suggestions now are case insensitive (see `Permenant' bug in
Issuezilla)
- weight ngram suggestions (with the longest common subsequent
algorithm,
also considering lengths of bad word and suggestion, identical first
letters and almost completely identical character positions)
- set strict affix congruency in expand_rootword(). Now ngram
suggestions
are good for languages with rich morphology and also better for
English.
Rationale: affixed forms of the first ngram suggestion
very often suppress the second and subsequent root word suggestions.
But
faults in affixes are more uncommon, and can be fix without
suggestions.
We must prefer the more informative second and subsequent root word
suggestions instead of the suggestions for bad affixes.
- a better suggestion may not be substring of a less good suggestion
Rationale: Suggesting affixed forms of a root word is
unnecessary, when root word has got better weighted ngram value.
(Checking substrings is a good approximation for this refinement.)
- lesser ngram suggestions (default 3 maximum instead of 10)
Rationale: For users need a big extra effort to check a lot of bad
ngram
suggestions, nine times out of ten unnecessarily. It is very
distracting, because ngram suggestions could be very different.
Usually Myspell and Hunspell suggest one or two suggestions with
the old suggestion algorithms (maximum is 15), with ngram algorithm
often gives maximum number suggestions. With strict affix congruency
and other refinements, the good suggestion there is usually among the
first three elements.
- new affix parameter: MAXNGRAMSUG
* src/hunspell/*: support agglutinative languages with rich prefix
morphology or with right-to-left writing system (for example, Turkic
and Austronesian languages with (modified) Arabic scripts).
- new affix parameter: COMPLEXPREFIXES
Set twofold prefix stripping (but single suffix stripping)
* src/hunspell/affixmgr.cxx:
- speed up prefix loading with tree sorting algorithm.
* tests/complexprefixes.*, tests/complexprefixesutf.*:
Coptic example posted by Moheb Mekhaiel
* src/hunspell/hashmgr.cxx: check size attribute in dic file
suggested by Daniel Naber
Rationale: With missing size attribute Hunspell allocates too small
and
more slower hash memory, and Hunspell can lose first dictionary word.
* src/hunspell/affixmgr.cxx: check stripping characters and condition
compatibility in affix rules (bugs detected in cs_CZ, es_ES, es_NEW,
es_MX, lt_LT, nn_NO, pt_PT, ro_RO and sk_SK dictionaries). See release
notes of Hunspell 1.0.9 in NEWS.
* src/hunspell/affixmgr.cxx: check unnecessary fields in affix rules
(bugs detected in ro_RO and sv_SE dictionaries). See release notes.
* src/hunspell/affixmgr.cxx: remove redundant condition checking
in affix rules with stripping characters (redundancy in OpenOffice.org
dictionaries reported by EleonĂłra Goldman)
Rationale: this is a little optimization, but it was excellent for
detect the bad ngram affixation with bad or weak affix conditions.
* tests/germancompounding.aff: improve compound definition
- use dash prefix instead of language specific tokenizer
Rationale: Using uniform approach is the right way to check and
analyze
compound words. Language specific word breaking is deprecated, need
a sophisticated grammar checking for word-like word pairs
(for example in Hungarian there is a substandard, but accepted
syntax with dash for word pairs: cats, dogs -> kutyĂĄk-macskĂĄk (like
cats/dogs in English).
* test Hunspell with 54 OpenOffice.org dictionaries: see release notes
bug fixes:
* src/hunspell/suggestmgr.*: add time limit to exponential
algorithm of the related character map suggestion
Rationale: a long word in agglutinative languages or a special pattern
(for example a horizontal rule) made of map characters can `crash' the
spell checker.
* src/hunspell/affentry.cxx: add() functions: fix bad word generation
checking stripping characters (see similar bug in unmunch)
* src/hunspell/affixmgr.cxx: parse_file(): fix unconditional getNext()
call for ~AffixMgr() when affix file is corrupt.
* src/hunspell/affixmgr.*: AffixMgr(), parse_cpdsyllable(): fix missing
string duplications for ~AffixMgr() when affix file is corrupt.
* src/hunspell/affixmgr.*: parse_affix(): fix fprintf() call when affix
file is corrupt. Bug reported by Daniel Naber.
* suggestmgr.cxx: replace single usage of 'strdup' with 'mystrdup'
patch by Chris Halls (debian.org)
* src/hunspell/makefile.mk: add makefile.mk for compiling in
OpenOffice.org
See README in Hunspell UNO modul.
Problems with separated compiling reported by Rene Engelhard
* src/hunspell/hunspell.cxx: fix pseudoroot support
- search a not pseudoroot homonym in check()
* tests/pseudoroot4.*: test this fix
* src/tools/unmunch.c: fix bad word generation when conditions
are shorter or incompatible with stripping characters in affix rules
* src/tools/unmunch.c: fix mychomp() for de_AT.dic and other dic files
without last new line character.
other changes:
* src/hunspell/suggestmgr.*: erase ACCENT suggestion
Rationale: ACCENT suggestion was the same as Kevin Hendrick's map
suggestion algorithm, but with a less good interface in affix file.
* src/hunspell/suggestmgr.*: combine cycle number limit
in badchar(), and forgotchar() with a time limit.
* src/hunspell/affixmgr.*: remove NOMAPSUGS affix parameter
* src/hunspell/{suggestmgr,hunspell}.*: strip periods from
suggestions (restore MySpell's original behaviour)
Rationale: OpenOffice.org has an automatic period handling mechanism
and suggestions look better without periods.
- new affix file parameter: SUGSWITHDOTS
Add period(s) to suggestions, if input word terminates in period(s).
(No need for OpenOffice.org dictionaries.)
* tests/germancompounding.aff: improve bad german affix in affix example
(computeren->computern). Suggested by Daniel Naber.
* src/tools/example.cxx: add Myspell's example
* src/tools/munch.cxx: add Myspell's munch
* man{,/hu}/hunspell.4: refresh manual pages
----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]