Re: [native-lang] Status update season!

Christian Lohmaier Wed, 27 Dec 2006 14:37:07 -0800

Hi Lars,

On Fri, Dec 22, 2006 at 07:24:50PM +0100, Lars Aronsson wrote:
> Christian Lohmaier wrote:
> 
> > This could also mean that these are just "dumb" wordlists that don't
> > make use of affix transformations. Not really suitable for comparing
> > quality then, even when the languages are closely related.
> 
> This is not the case, though.


Well, I cannot judge...

> Swedish, Danish and Norwegian are 
> closely related and have the same language structure.  An expanded 
> wordlist is 5...6 times longer than a well-compressed one using 
> the right ispell flags.

When you say: Comparing the expanded wordlists, then this more like an
objective measure as compared to just comparing the "bytesize of the
dictionary files"-

> That factor is smaller for English and 
> German and a lot larger for Finnish and Hungarian.

Might be, but not really my point. You might be able to compare unmunged
wordlists, but you can't judge the "quality" by this data.

The initial point was: compare the dictionary files. And this is just
not a good way to compare them. Now you give additional data (that
allows to estimate the unmunged size), but while this now allows to
compare the number of words included, this still doesn't allow a
judgement of quality.

> The current German dictionary maintained by Björn Jacke has 80,000 
> basic forms which expand to 300,000 variations, for a factor of 
> 3.75.  Swedish/Danish/Norwegian have the same way to form basic 
> words (with compounds) as German.  Basic words can often be 
> translated syllable by syllable, so the number of basic forms 
> should be about the same.

"should"... And as you say yourself: Not all affix possibilities are
used. So while some of the features are used, others are not. So what
you really want to compare is the expanded wordlist...

> But the Scandinavian languages use 
> endings instead of the definite article (the/der/die/das), 
> resulting in a larger number of expanded variations.
> 
> The current da_DK.dic has 108,400 basic forms and expands to 
> 380,199 variations.  The two versions of Norwegian have 133,242 
> (nb_NO) and 102,578 (nn_NO) basic forms, respectively, and expand 
> to 556,600 and 295,306 variations. However, the currently used 
> Swedish dictionary (which is from 2003, but almost unchanged since 
> 1997) has 24,489 basic forms and expands to 118,270 variations.  
> This is clearly inferior.

So here you see another problem. If a language has lots of variations of
a single word, how can you judge that not 12000 of the expanded words
are based on "useless" words (not in widespread use, hiding typos,...)
or the other way round: You cannot tell that the important ones are
present. 
So numbers like "That dictionary knows 300.000 words, the other one only
contains 230.000" doesn't allow you to say that the one with the
300.000 words is the one with the "better" quality. It still depends on
subjective things, the "feel" of the dictionary.
And probably different groups of users (that use different writing
style) might give you different answers on which of the two dics is the
"better" one.

> Of course, if the Swedish dictionary contained 24,000 relevant 
> words and the other languages had many highly specialized words 
> which are only rarely used, we'd still stand a chance.  However, 
> this is not the case either.

Well, just by looking at those numbers you cannot tell. You're already
applying the subjective judgment here :-)
 
> [...]
> Yesterday I found this paper by two Hungarian authors, who discuss 
> Zipf's law and the minimum number of words in a dictionary 
> required to cover some percentage of a given corpus of text,
> http://www.nslij-genetics.org/wli/zipf/nemeth02.pdf

Having the right number doesn't help if these numbers are made of the
wrong words.

> Their most important observation is that a decent spelling 
> dictionary needs to contain 20,000 words (variations) for English 
> and 80,000 words for German, but 400,000 for Hungarian.  The right 
> number for Scandinavian languages should thus be somewhere between 
> 80,000 and 400,000.

Sure, this is a rough estimate. But to get to know those "important"
words is what really counts. You could compare the quality by comparing
it on how it would check these important words, but for that you'd need
to have this list. And if you have that list, you can just create a
dictionary from it,...

So what really is left is the subjective judgmenet. (Apart from:
"obviously too few words")

> [...] 
> > I don't think there is a way to measure this at all. You "feel" that it
> > is good or bad, but you cannot really measure it.
> > You can give examples, but that's about it. (IMHO)
> 
> In the case of OpenOffice.org, what really matters is what people 
> "feel" about Microsoft Word's spell checker.  If that was really 
> useless, we wouldn't have to bother.  But now we have to bother.

I didn't want to tell that the OOo dicts shouldn't be improved, but that
you cannot judge it from any set of numbers.

If you want to compare it to Word's spell-checker, you'd have to harvest
thousands of documents in the desired language, run it through word's
spell-check and OOo's spell-check and compare the results.
A statistical step can help a human who has a look at the
misses/differences to decide what words the dictionary is lacking and
how good/bad it compares to other languages (always: compared to Word's
spell-checker only)

But that's not an easy task at all.

ciao
Christian
-- 
NP: Metallica - The House Jack Built

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [native-lang] Status update season!

Reply via email to