Hello everyone,
I don’t like mailing-lists, so I have subscribed here just to explain
few things about dictionaries. Then I’ll vanish.
Rob Weir wrote:
Just make sure that you explain what a spell checking dictionary is.
Otherwise any legal types will be confused. This is not a dictionary
like Webster's, with words and definitions, where the definitions are
creative content. A spell checking dictionary is more of a word list.
I'm not sure what the creative expression is in a list of all common
words in a language and how that could be copyrighted. Of course, I
am not a lawyer.
Few dictionaries are just words lists, but most of them are lists of
words tagged with flags described in an affixation file which specify
what are the rules to generate inflexions. This affixation file can be
quite simple or very complex. And this can be a difficult matter.
It looks easy at first, but when you begin to get deeper in this
matter, there is often a lot of issues to handle. Create a proper
affixation file can really be a hard work. And even if the difficulty is
not high, this can be a very long job.
So, no, Hunspell dictionaries are not just words lists.
For example, it took me one year and countless hours of work to rewrite
the affixation file of the French dictionaries from scratch. Even after
that, there were still a lot of bugs (not spelling mistakes). For one
year, I had to patch regularly the affixation file. Even after few
years, there is still sometimes something to fix. The French
dictionaries contain approximatively 13000 rules.
Here an example of one of the most complex flags:
http://www.dicollecte.org/affixes.php?prj=fr&flag=c2
(AFAIK, there is only one dictionary which has a more complex affixation
file, the Hungarian one.)
I also tagged the affixation file in order to generate 4 different
dictionaries with a script, to offer to users the mean to write
according to their preferences towards the optional and controversial
French spelling reform of 1990.
Besides, 99 % of entries have been manually grammatically tagged.
Several contributors did a tremendous job by adding lexical tags,
adding many words, moving entries in different subdictionaries according
to our policy, handling special cases, reporting mistakes and issues.
Because, spelling matters are much more complex than you think,
especially if you want to use your dictionary for grammar checking.
We often have to handle old, new or variant spelling just for one
word, and there are decisions to take about what to do with special
cases, which are actually very numerous. Managing dictionaries is not a
trivial task.
Here is the "bugtracker" where we work on the French dictionaries.
http://www.dicollecte.org/propositions.php?prj=fr&tab=E [fr]
(This bugtracker also allows us to commit in the dictionary in the
database.)
The changelog:
http://www.dicollecte.org/log.php?prj=fr
This dictionary is used by the both French grammar checkers.
What you said about copyright could be right for a list generated by
script from a corpus, but that’s not true for dictionaries who are
conceived by human with their knowledge, their work and their choices.
But we'll never resolve this on legal grounds. At Apache we would not
bundle a dictionary under a legal theory if the compiler of the
dictionary did not want us to. I think we should respect the
dictionary compiler's wishes and intent,
_even if legally we're not obligated to_.
Wow... That’s really not encouraging for people who may consider to
change the license of their work... Does IBM think the same way?
Few years ago, when I began to contribute for FLOSS, I thought the
less restrictive licenses were the better ones, only because I didn’t
care and I was ignorant about licensing and political matters.
As time goes, I think more and more the opposite. And when I read
you, I’m beginning to think I was still too soft on that topic.
3) We could contact the compilers of the dictionary and ask if they
would make them available under a difference license. Generally
people make things available under an OSS license because they want to
see other projects use them. If we tell them that a leading
application like OpenOffice can no longer user their dictionary, this
might persuade them to change their license.
Here is the situation for the French dictionaries:
1. The Hunspell spelling dictionaries
Licenses: MPL/LGPL/GPL
As I am the sole author of the affixation file, as I grammatically
tagged myself about 90 % of all entries (without copying another lexicon
with a script), I can say for sure that I do not intend to change the
licenses for the Apache one.
When I built Dicollecte, my goal was to encourage people to
contribute for all and give back the improvements they did. Switching to
the Apache license would be a contradiction with everything I did.
By the way, these dictionaries _require_ Hunspell. They won’t work
properly with Myspell. I saw a lot of people think Hunspell dictionaries
will work with Myspell. That’s a wrong assumption. Hunspell can use
Myspell dictionaries, but Hunspell also offers a lot of new features
which allow to improve the dictionaries structure.
And Myspell does not recognize double suffixation or double
prefixation, cannot handle duplicate lemmas, does not handle
morphological tags, has a limited amount of flags, does not recognize
Hunspell compound commands, etc. (I am not even sure that Myspell can
use UTF-8 files.)
But, good for you, AFAIK, many dictionnaries still have a Myspell
structure. But not the French ones and some others.
2. The thesaurus
The initial and main author released it under license LGPL.
Now he’s dead. AFAIK, there is no way to change the license before
his work is considered as puplic domain, but there also have been
several improvements on the initial work.
At the moment, I am working on it to transform it as a list of
"synsets" which could be used to generate a better thesaurus. A list of
synsets would be a far better basis to work on. I don’t know if I will
succeed. This is a difficult matter and it requires a lot of work.
3. Hyphenation rules
Licence LGPL.
This is a dictionary converted from the hyphenation rules for TeX,
modified somehow to handle several issues.
I did nothing on it. I’m just packaging it in the extensions for
OOo/LibO. You'll have to contact the peoples who created it.
4) We could convert another word list or dictionary, one that has a
better license, into Hunspell format.
Hmmm...
You may generate affixation rules for Myspell with a script… but
then, these dictionaries will probably be such a mess that you’ll be
very lucky if you find someone with enough abnegation to improve it. The
main issues of dictionaries are:
- if you just create a list of words, you may only improve it with
text parser or other lexicons, but it will be hard and annoying to
improve it manually, as the list will be very, very long, and it will be
a memory waste. And each times you will regenerate it with your script,
you’ll have to fix again manually what you did before.
- if you create an affixation file with script, your dictionary will
be a mess, no easy way to improve it, as the dictionary structure will
not be intuitive for a human being. And again, you cannot really mix
improvements by scripting and improvements by human being.
The best way is to get somewhere a good lexicon already tagged with a
non-restrictive license. Even then, you’ll have to write manually a
proper affixation file… and then, you may discover it is not the easy
task you may think it is, unless your language is somehow very logical,
with neither exceptions, neither weird stuff…
Regards,
Olivier R.