Hello all,
Le 08/11/2011 01:58, Rob Weir a écrit :
Spell checking dictionaries are just compilations of facts that are
constrained by the preexisting external facts of the language. The
compiler of the dictionary does not create these facts. He merely
encodes them. The particular dictionary might be copyrightable as a
specific selection, coordination and arrangement of these facts, but
fair use would allow me to extract the same facts from the
dictionaries, via reverse engineering, and make my own selection,
coordination and arrangement of these same facts and distribute them
as my own dictionary. In other words, you might be able to protect
the compilation of facts, but you cannot protect the underlying facts,
or prevent people from copying your encoding of these facts and
distributing a different arrangement of them. Copyright protection on
a compilation of facts is extremely thin. It is that simple.
I am no expert on legal matters, and I think you might get different
legal answers in different countries.
So I’ll try to stay on technical ground.
Let’s assume that someone wants to create an Hunspell dictionary from
scratch. He finds a huge lexicon of well-organized informations about
his language, a proper list of words with morphological data, tags, etc.
Let’s assume this is just a compilation of facts.
(Actually, even saying this lexicon is a mere compilation of facts is
arguable, because there can also be a lot of specific classification,
personal tags, interpretation data, etc. Otherwise, we wouldn’t have
many arguments when we tagged the French dictionary. But let’s )
Does this list would _tell_ him to create an affixation file? No.
Does this list would _help_ him to create an affixation file? No.
Is there just one way to create an affixation file from this list? No.
Actually, even if I had such a lexicon of all facts on the French
language when I began the work on the affixation file, it would have
required as much time, as much reflexion, as much personal choices.
Creating an affixation file is on higher level than just collecting
data. It’s not a way of classifying or tagging or selecting data.
So, what is an affixation file? That’s a description of a compression
algorithm, a description of a human understandable logic to factorize
data on a specific language.
The lexicon could have been compressed with zip, rar, 7z or whatever
algorithm. In the same way, there is many ways to factorize a lexicon
with a human understandable logic.
When I created the French affixation file, there was already one
existing, but I was really not satisfied with it, so I rewrote it.
With the previous French dictionary, there was approximatively 600 rules
in the affixation file, and 92,000 entries in the words list.
After one year on work on the new affixation file, there was
approximatively 12,000 rules and 60,000 entries, but this new dictionary
generates more inflexions than the previous one, and also far less
mistakes (because affixation files can also have a lot of side effects
and can generate a lot of wrong inflexions).
Even now, the compression method could be really different than it is.
But the data set would be the same. And, actually, I’m considering of
modifying it in a way to fit more to the grammar checker which retrieves
these data from Hunspell.
So, is a very specific compression algorithm description for language
data can be copyrighted? I don’t know, but I think this a creative matter.
HTH.
Regards,
Olivier