Christian Lohmaier wrote: > This could also mean that these are just "dumb" wordlists that don't > make use of affix transformations. Not really suitable for comparing > quality then, even when the languages are closely related.
This is not the case, though. Swedish, Danish and Norwegian are closely related and have the same language structure. An expanded wordlist is 5...6 times longer than a well-compressed one using the right ispell flags. That factor is smaller for English and German and a lot larger for Finnish and Hungarian. The current German dictionary maintained by Björn Jacke has 80,000 basic forms which expand to 300,000 variations, for a factor of 3.75. Swedish/Danish/Norwegian have the same way to form basic words (with compounds) as German. Basic words can often be translated syllable by syllable, so the number of basic forms should be about the same. But the Scandinavian languages use endings instead of the definite article (the/der/die/das), resulting in a larger number of expanded variations. The current da_DK.dic has 108,400 basic forms and expands to 380,199 variations. The two versions of Norwegian have 133,242 (nb_NO) and 102,578 (nn_NO) basic forms, respectively, and expand to 556,600 and 295,306 variations. However, the currently used Swedish dictionary (which is from 2003, but almost unchanged since 1997) has 24,489 basic forms and expands to 118,270 variations. This is clearly inferior. Of course, if the Swedish dictionary contained 24,000 relevant words and the other languages had many highly specialized words which are only rarely used, we'd still stand a chance. However, this is not the case either. Fortunately, my friend who maintains the Swedish dictionary has recently published a new version (DSSO 1.22) that expands to 242,611 variations, so he's making great progress. I hope this will be included in future versions of OpenOffice.org. We're catching up on the Danes and Norwegians, but they are still ahead. Yesterday I found this paper by two Hungarian authors, who discuss Zipf's law and the minimum number of words in a dictionary required to cover some percentage of a given corpus of text, http://www.nslij-genetics.org/wli/zipf/nemeth02.pdf Their most important observation is that a decent spelling dictionary needs to contain 20,000 words (variations) for English and 80,000 words for German, but 400,000 for Hungarian. The right number for Scandinavian languages should thus be somewhere between 80,000 and 400,000. However, that is only counting the most frequent words from a language. When I add "home" to an ispell/hunspell dictionary, I also add "homes" and "homely" because of how the flags work, even though "homely" isn't necessarily among the very common and relevant words. So I add a lot of less relevant words, which don't contribute much to the dictionary's usefulness. When I add one basic word and thus 5..6 variations (for Swedish), perhaps I only add 2..3 useful variations. It is hard to know just how much the numbers are inflated. > I don't think there is a way to measure this at all. You "feel" that it > is good or bad, but you cannot really measure it. > You can give examples, but that's about it. (IMHO) In the case of OpenOffice.org, what really matters is what people "feel" about Microsoft Word's spell checker. If that was really useless, we wouldn't have to bother. But now we have to bother. -- Lars Aronsson ([EMAIL PROTECTED]) Aronsson Datateknik - http://aronsson.se --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]