Christian Lohmaier wrote:

> This could also mean that these are just "dumb" wordlists that don't
> make use of affix transformations. Not really suitable for comparing
> quality then, even when the languages are closely related.

This is not the case, though.  Swedish, Danish and Norwegian are 
closely related and have the same language structure.  An expanded 
wordlist is 5...6 times longer than a well-compressed one using 
the right ispell flags.  That factor is smaller for English and 
German and a lot larger for Finnish and Hungarian.

The current German dictionary maintained by Björn Jacke has 80,000 
basic forms which expand to 300,000 variations, for a factor of 
3.75.  Swedish/Danish/Norwegian have the same way to form basic 
words (with compounds) as German.  Basic words can often be 
translated syllable by syllable, so the number of basic forms 
should be about the same. But the Scandinavian languages use 
endings instead of the definite article (the/der/die/das), 
resulting in a larger number of expanded variations.

The current da_DK.dic has 108,400 basic forms and expands to 
380,199 variations.  The two versions of Norwegian have 133,242 
(nb_NO) and 102,578 (nn_NO) basic forms, respectively, and expand 
to 556,600 and 295,306 variations. However, the currently used 
Swedish dictionary (which is from 2003, but almost unchanged since 
1997) has 24,489 basic forms and expands to 118,270 variations.  
This is clearly inferior.

Of course, if the Swedish dictionary contained 24,000 relevant 
words and the other languages had many highly specialized words 
which are only rarely used, we'd still stand a chance.  However, 
this is not the case either.

Fortunately, my friend who maintains the Swedish dictionary has 
recently published a new version (DSSO 1.22) that expands to 
242,611 variations, so he's making great progress.  I hope this 
will be included in future versions of OpenOffice.org.  We're 
catching up on the Danes and Norwegians, but they are still ahead. 

Yesterday I found this paper by two Hungarian authors, who discuss 
Zipf's law and the minimum number of words in a dictionary 
required to cover some percentage of a given corpus of text,
http://www.nslij-genetics.org/wli/zipf/nemeth02.pdf

Their most important observation is that a decent spelling 
dictionary needs to contain 20,000 words (variations) for English 
and 80,000 words for German, but 400,000 for Hungarian.  The right 
number for Scandinavian languages should thus be somewhere between 
80,000 and 400,000.

However, that is only counting the most frequent words from a 
language.  When I add "home" to an ispell/hunspell dictionary, I 
also add "homes" and "homely" because of how the flags work, even 
though "homely" isn't necessarily among the very common and 
relevant words.  So I add a lot of less relevant words, which 
don't contribute much to the dictionary's usefulness.  When I add 
one basic word and thus 5..6 variations (for Swedish), perhaps I 
only add 2..3 useful variations.  It is hard to know just how much 
the numbers are inflated.

> I don't think there is a way to measure this at all. You "feel" that it
> is good or bad, but you cannot really measure it.
> You can give examples, but that's about it. (IMHO)

In the case of OpenOffice.org, what really matters is what people 
"feel" about Microsoft Word's spell checker.  If that was really 
useless, we wouldn't have to bother.  But now we have to bother.


-- 
  Lars Aronsson ([EMAIL PROTECTED])
  Aronsson Datateknik - http://aronsson.se

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to