Hi,

The most time-consuming suggestion algorithms of Hunspell are the MAP and
n-gram suggestions. MAP is a character permutation algorithm, and for a
single misspelled French word with 8 vowels (using the MAP definition of
your affix file) can check ~4^8 = 65 thousand possible suggestions.

MAP algorithm has a time limit (~1 sec.). This is an acceptable value under
word processing. But time consuming of the n-gram based suggestion algorithm
depends only on the dictionary size. More than 100 thousand dictionary words
can result big n-gram suggestion time for a long word.

Hunspell with the 90 thousand words of the recent French dictionary is not
too slow for a single suggestion or spell checking a long document without
suggestions. For other tasks (automatic spell checking of long texts with
suggestions), remove or limit MAP definition, and use

MAXNGRAMSUGS 0

in the affix file to disable n-gram suggestions.

Other option is to use affixes to compress a large dictionary (~200-300
thousand words). There is a new tool in the Hunspell distribution for
automatic affix compression, "affixcompress". A Mongolian word list with 2.7
million words has been compressed to 77 thousand words by affixcompress:
http://www.openoffice.org/issues/show_bug.cgi?id=92263

(Note: affixcompress is not the best tool for an agglutinative language,
like Mongolian, but I hope, future versions will be able to detect the
morphology and classify the words of a huge corpus. Now the output of
affixcompress will help to detect real stems and frequent suffixes from the
words of a text corpus.)

It is also useful to split a large dictionary to a base part (~100 thousand
word) and extra dictionaries. Hunspell library and standalone Hunspell have
already supported extra dictionaries (see Hunspell manual). I hope,
OpenOffice.org will support also this feature in the near future.

Best regards,
László

2008/8/18 Thomas Lange - Sun Germany - ham02 - Hamburg <[EMAIL PROTECTED]
>

>
> Hi,
>
> I don't know the algorithm used in hunspell.
> Thus how about other dictionaries?
> Maybe the large number of entries in the affix file is exactly the
> reason why it can be so fast with Hungarian words...
>
> Thomas
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Reply via email to