Hello all,

Le 08/11/2011 01:58, Rob Weir a écrit :

Spell checking dictionaries are just compilations of facts that are
constrained by the preexisting external facts of the language.  The
compiler of the dictionary does not create these facts.  He  merely
encodes them.  The particular dictionary might be copyrightable as a
specific selection, coordination and arrangement of these facts, but
fair use would allow me to extract the  same facts from the
dictionaries, via reverse engineering, and make my own selection,
coordination and arrangement of these same facts and distribute them
as my own dictionary.  In other words, you might be able to protect
the compilation of facts, but you cannot protect the underlying facts,
or prevent people from copying your encoding of these facts and
distributing a different arrangement of them.  Copyright protection on
a compilation of facts is extremely thin.  It is that simple.

I am no expert on legal matters, and I think you might get different legal answers in different countries.

So I’ll try to stay on technical ground.

Let’s assume that someone wants to create an Hunspell dictionary from scratch. He finds a huge lexicon of well-organized informations about his language, a proper list of words with morphological data, tags, etc. Let’s assume this is just a compilation of facts.

(Actually, even saying this lexicon is a mere compilation of facts is arguable, because there can also be a lot of specific classification, personal tags, interpretation data, etc. Otherwise, we wouldn’t have many arguments when we tagged the French dictionary. But let’s )

Does this list would _tell_ him to create an affixation file? No.
Does this list would _help_ him to create an affixation file? No.
Is there just one way to create an affixation file from this list? No.

Actually, even if I had such a lexicon of all facts on the French language when I began the work on the affixation file, it would have required as much time, as much reflexion, as much personal choices.

Creating an affixation file is on higher level than just collecting data. It’s not a way of classifying or tagging or selecting data.

So, what is an affixation file? That’s a description of a compression algorithm, a description of a human understandable logic to factorize data on a specific language.

The lexicon could have been compressed with zip, rar, 7z or whatever algorithm. In the same way, there is many ways to factorize a lexicon with a human understandable logic.

When I created the French affixation file, there was already one existing, but I was really not satisfied with it, so I rewrote it. With the previous French dictionary, there was approximatively 600 rules in the affixation file, and 92,000 entries in the words list. After one year on work on the new affixation file, there was approximatively 12,000 rules and 60,000 entries, but this new dictionary generates more inflexions than the previous one, and also far less mistakes (because affixation files can also have a lot of side effects and can generate a lot of wrong inflexions).

Even now, the compression method could be really different than it is. But the data set would be the same. And, actually, I’m considering of modifying it in a way to fit more to the grammar checker which retrieves these data from Hunspell.

So, is a very specific compression algorithm description for language data can be copyrighted? I don’t know, but I think this a creative matter.

HTH.

Regards,
Olivier

Reply via email to