Re: Hunspell dictionaries are not just words lists (+ other matters)

Olivier R. Tue, 08 Nov 2011 02:13:07 -0800

Hello all,

Le 08/11/2011 01:58, Rob Weir a écrit :

Spell checking dictionaries are just compilations of facts that are
constrained by the preexisting external facts of the language.  The
compiler of the dictionary does not create these facts.  He  merely
encodes them.  The particular dictionary might be copyrightable as a
specific selection, coordination and arrangement of these facts, but
fair use would allow me to extract the  same facts from the
dictionaries, via reverse engineering, and make my own selection,
coordination and arrangement of these same facts and distribute them
as my own dictionary.  In other words, you might be able to protect
the compilation of facts, but you cannot protect the underlying facts,
or prevent people from copying your encoding of these facts and
distributing a different arrangement of them.  Copyright protection on
a compilation of facts is extremely thin.  It is that simple.

I am no expert on legal matters, and I think you might get differentlegal answers in different countries.


So I’ll try to stay on technical ground.

Let’s assume that someone wants to create an Hunspell dictionary fromscratch. He finds a huge lexicon of well-organized informations abouthis language, a proper list of words with morphological data, tags, etc.Let’s assume this is just a compilation of facts.

(Actually, even saying this lexicon is a mere compilation of facts isarguable, because there can also be a lot of specific classification,personal tags, interpretation data, etc. Otherwise, we wouldn’t havemany arguments when we tagged the French dictionary. But let’s )


Does this list would _tell_ him to create an affixation file? No.
Does this list would _help_ him to create an affixation file? No.
Is there just one way to create an affixation file from this list? No.

Actually, even if I had such a lexicon of all facts on the Frenchlanguage when I began the work on the affixation file, it would haverequired as much time, as much reflexion, as much personal choices.

Creating an affixation file is on higher level than just collectingdata. It’s not a way of classifying or tagging or selecting data.

So, what is an affixation file? That’s a description of a compressionalgorithm, a description of a human understandable logic to factorizedata on a specific language.

The lexicon could have been compressed with zip, rar, 7z or whateveralgorithm. In the same way, there is many ways to factorize a lexiconwith a human understandable logic.

When I created the French affixation file, there was already oneexisting, but I was really not satisfied with it, so I rewrote it.With the previous French dictionary, there was approximatively 600 rulesin the affixation file, and 92,000 entries in the words list.After one year on work on the new affixation file, there wasapproximatively 12,000 rules and 60,000 entries, but this new dictionarygenerates more inflexions than the previous one, and also far lessmistakes (because affixation files can also have a lot of side effectsand can generate a lot of wrong inflexions).

Even now, the compression method could be really different than it is.But the data set would be the same. And, actually, I’m considering ofmodifying it in a way to fit more to the grammar checker which retrievesthese data from Hunspell.

So, is a very specific compression algorithm description for languagedata can be copyrighted? I don’t know, but I think this a creative matter.


HTH.

Regards,
Olivier

Re: Hunspell dictionaries are not just words lists (+ other matters)

Reply via email to