*
**
*tl;dr - If you publish data, attach the CC0 license to it, but that’s
basically just advertising - don’t think it means anything.*
*If you use data, you do not have to care much about the data license.*
*If you republish data, it’s a bit more complicated, but not as horrible
as you might think.*
*
Imagine a student reading a CC-BY-SA published textbook on compilers.
Next thing, based on that knowledge, he writes a parser and publishes
the binary on the Web. Does he have to acknowledge the textbook? Does he
have to publish his code under the same license?
Imagine a designer creating an image with GIMP, a fantastic open source
image processing tool, published under the GPL. Or a developer writing
his code in Eclipse. Or a website being served from a Linux box. What
legal implications does it have for the license of the image? For the
source code? For the served page?
Imagine a search engine that changes its background color depending on
the type of thing you are searching for. You enter a city - it turns
gray. You enter a person - red for females, blue for males, and purple
for others. You enter a company - yellow. And so on. Let us assume that
the search engine does that by figuring out the thing you are searching
for and then asking DBpedia for its type. Since DBpedia is licensed
under CC-BY-SA, does this mean we have to put a link on the search
result acknowledging DBpedia? Does this mean we have to publish our
search index under CC-BY-SA as well?
Imagine Red Cross publishing pages about the countries they work in, and
adding the population data to each of them from Freebase, the location
from OpenStreetMaps, the local name of the country from GeoNames, and
the capital from DBpedia. What amount of legal disclaimer would need to
be displayed on the page? Maybe some of the data items derive from
another source? What about their licenses? What about this license
stacking effect?
There are some rather vague ideas floating about how the whole
intellectual property law apparatus works for data. I have mulled over
this for a long time, and read more laws and court cases than I care to
admit. I want to try to make a few points in the following.
Let’s start with the basics. What laws do actually apply?
Copyright law protects the expression, not the idea - the form, not the
content. You can watch the newest Iron Man movie, and you are legally
allowed to annoy your friends with retellings of the movie as often as
you want. But you are not allowed to film it with your phone camera in
the theater and display it to your friends. If you learn something from
a textbook, you are free to write your own textbook, adding other
knowledge you have acquired, possibly from other textbooks and
publications. Only if you start copying the original texts to closely,
you will get into legal trouble.
Almost all of the above mentioned licenses - all Creative Commons
licenses currently available, as well as the GFDL or the GPL - are based
on copyright laws. The GPL has started, as Stallmann admits, as a legal
hack of copyright law. This makes a lot of sense, since these licenses
have not meant to cover data, but expressions: texts, music, and the
like. This means, these licenses cannot extend beyond that. They only
cover the expression. They cover the actual RDF/XML file, the string of
characters. Not the content. Not the graph.
(Note that ODBL and the current draft of the upcoming fourth revision of
CC go beyond copyright and include database right where applicable, i.e.
within the legislation of the EU. This extension is irrelevant for the US.)
This means that such licenses, like GFDL for data, have no restricting
effect if you want to use the data. Only if you want to republish the
data files more or less verbatim (in whole or partially, standalone or
as part of a bigger project), you need to think about the original
license. Merely including the data (not the files!) has no effect
stemming from copyright.
This also makes intuitively sense: if someone takes Wikipedia and counts
the distribution of words and letters in Wikipedia, the subsequent
publication of the results is not restricted by the original license
Wikipedia was published. If someone takes the whole Web, and creates a
graph of all links on the Web, and starts to apply some algorithms on
this graph, the subsequent usage of the results of these algorithms are
not subject to any of the licenses of the original texts published on
the Web. Copyright simply does not extend this far. And that is good.
So much to copyright. Unfortunately, the European Union went a step
further. They recognized that copyright does not apply to databases.
They also recognized that the EU was not doing well in their competition
against the US, with regards to publishing databases. So they decided to
level the field by introducing a completely new right, the database
right. This protects the effort that goes into creating databases -
basically their schema (which columns should I have) and the coverage
(which rows do I have in my database). Ten years later the EU made an
evaluation of the effectiveness of the laws, and came to some
interesting conclusions: first, technically the newly database rights
made things more complicated; second, most publishers obviously do not
understand it, but are happy with what they think it means (which
usually contradicts with what it actually means); and third, it
completely failed in its goal to advance the database publishing sector.
The report offers options to drop the whole database rights thing again,
but so far nothing has happened.
Also, this novel database right got a few major blows by the European
Court of Justice, where it clearly stated that the right does not cover
the creation of the database, merely the effort put into obtaining,
selecting, and cleaning a database. This means, e.g. that the
publication of match dates and fixtures by FIFA can not be protected
under the database right. On the other hand, if an external Website
keeps statistics of all FIFA player, how much their cost, where they
currently are, etc., then their database as a whole could be protected.
But to make it clear: the database right does not apply to single data
items in the database: should I keep a database of all cities in the UK
and their populations, and if someone asks for the population of Oxford
from my database, the database rights do not prevent them from
republishing and using that data item as they like. Eurostat cannot sue
you if you tell someone the population of France.
To summarize on database rights: the EU, and only the EU, have
introduced in 1996 the so called database rights. They are independent
of copyright, and cover a database as a whole in certain circumstances.
If you are in the EU, and want to use the data, database right does not
restrict you. It only restricts you from republishing the database as a
whole or in relevant parts.
Besides the legal foundations of the data licenses, one also has to
consider that copyright law refers dominantly to the right to copy the
data, not to use it: if you want to count how often certain explicit
words are uttered in a movie like Pulp Fiction, you are free to do so.
If you want to count and compare the death count in certain books and
movies (like, Rambo, War and Peace, and the Bible - the results might
surprise you), you are free to do so. You are free to publish the
results, and you are even more free to use them internally in your
organization.
Having said that, I still recommend to add the CC0 license to a dataset
when you publish it. I grudge every time I do it, but it still makes
sense. Not because I believe that it means much: as said, the data in it
is free anyway. But because a lot of other people believe that it means
a lot. They might believe that if they integrate a point of data from a
CC-BY-SA licensed dataset in their own dataset, they have to publish it
under CC-BY-SA as well. They might believe that mixing a CC-BY-SA
dataset with an ODBL dataset and displaying the results is legally
impossible. Maybe they don’t even believe it, but they are required to
ask their lawyers, and their lawyers will prefer to play it safe for
their clients (it is their job!) and advise them accordingly. And for
all of these people, the CC0 license is an item of assurance. So if you
want your dataset to be usable by them, just add a CC0 license to it.
And grudge about it.
There is a completely independent aspect of why it could make sense to
cite your data sources, which is trust and provenance. Even if a dataset
is not published under a CC-BY-like license, meaning that it requires
attribution, it often makes sense to keep the provenance and attribution
intact - simply because the user of your data might ask for the source
themselves, and might want to check on their credibility. But
attribution for increasing your credibility is something entirely
different than attribution because you think you are legally obliged due
to the used data.
If I were an organization or individual with sufficient financial
backup, I would even offer to pick up your legal battles if a data
publisher ever sues you for using their data (not for republishing it
verbatim, though). I hope that maybe an organization or individual will
step up at some point to do so, but I wouldn’t hold my breath for it.
Both the US Supreme Court and the European Court of Justice have
repeatedly decided in favour of the freedom of data, be it the results
of games, be it telephone numbers, be it horse racing fixtures.
So, as paradoxical as it sounds: Data is free. Free the data!
There is a battle over minds going on. The one side fights for the
establishment and extension of intellectual property rights. In the last
decades, even years, they have achieved some considerable victories.
Copyright law, as it was introduced in the United States, was meant for
14 years, and had to be explicitly stated. Today it holds not only for
the lifetime of the creator, but also an additional 70 years (to
incentivize the creator to produce more, because an author would be much
less motivated to write if they knew that half a century after their
death their highly beloved publisher wouldn’t make profit out of their
work anymore). Today, copyright applies automatically, without any
registration or statement. There is no need to put the little c in a
circle anywhere. It is there, automatically, everywhere.
The extension from works to content, from expression to ideas, is
another dimension, this time in scope instead of time, in the continuous
struggle to extend and expand intellectual property rights. It is not
just a battle over the laws, but also, and more importantly, over our
believes and minds, to make us more accepting towards the notion that
ideas and knowledge belong to companies and individuals, and are not
part of our commons.
Every time data is published under a restrictive license, “they” have
managed to conquer another strategic piece of territory. Restrictive in
this case includes CC-BY, CC-BY-SA, CC-BY-NC, GFDL, ODBL, and (god
forbid!) CC-BY-SA-NC-ND, and many other such licenses.
Every time you wonder what license some data has that you want to use,
or whether you need to ask the data publisher if you can use it, “they”
have won another battle.
Every time you integrate two data sources and want to publish the
results, and start to wonder how to fulfill your legal obligation
towards the original dataset publishers, “they” laugh and welcome you as
a member of their fifth column.
Let them win, and some day you will be sued for mentioning a number.
Links:
I am not linking to the obvious texts, which are the actual laws. Read
them. They are not as impenetrable as you think. I mean, heck, if you
can make sense of an RDF/XML file, you shouldn’t be scared of some legal
text.
Evaluation of the European Commission on the effect of database rights
http://ec.europa.eu/internal_market/copyright/docs/databases/evaluation_report_en.pdf
US Supreme Court, Baker v. Selden - on the extent of copyright with
regards to the expression, not the content
http://www.justia.us/us/101/99/case.html
Sorry for the far too long reply. It is not meant as a critical reply to
Pascal and his colleagues’ text, but rather something that has been
brooding in me for a while. This text triggered me to write it down, and
in the framework of their text I would read it as a contribution to
point 5 of their way forward.
This text was written by me on a Saturday morning, as a completely
personal opinion. It does not represent the official point of view of
any current, former, or future employer, nor of any project I ever was,
am, or will be affiliated with or am thought to be affiliated with.
*
*