Re: [EMBOSS] Files included in EMBOSS but licensed ...

Chris Fields Sat, 30 Jul 2011 12:43:49 -0700

On Jul 30, 2011, at 6:36 AM, Adam Sjøgren wrote:

> On Sat, 30 Jul 2011 09:58:07 +0100, Peter wrote:
> 
>> A specific example might help. About 5 years ago a release of the
>> UniProt database (as plain text files) broke the Wisconsin (GCG)
>> sequence analysis package.
> 
> [...]
> 
> This is the opposite problem of what I tried to sketch.
> 
> Your example has closed source software that can't be fixed, leading to
> either preprocessing or changing the database rather than fixing the
> real problem.
> 
> If the software had been free, you could just have fixed the software.
> 
> Switch around "software" and "database", and you have the example I was
> trying to paint.


Yes, if the source were available fixing the parser would have been the best 
option.  But I think you are missing the fundamental point that Peter made 
(that you left out): the wording of the license allowed them to reformat the 
file w/o changing the actual content.  I'm not sure but I believe many GenPept 
documents are Uniprot-derived and follow the same concept. 

Data records and databases are not software, unless you are using some very 
fuzzy definition of such.

>> I expect there are many problems that arise if data ... and
>> documentation ... are considered to be software.
> 
> Sure. The whole GFDL debate took quite a while, I think.
> 
> But that doesn't change that one of the solutions outlined by Charles
> Plessy is necessary for Debian to distribute EMBOSS (and any other piece
> of free/redistributable software).

You'll also note Charles's distaste for the options mentioned.  He was also 
searching for alternatives.

>>> (I personally think it would make sense to change to a Creative Commons
>>> license that allows derivative works - Uniprot and others are going to
>>> be the canonical source for the data anyway, so nothing will be lost by
>>> them by doing that, as far as I can see.)
> 
>> Unlikely. The no-derivatives version is specifically there to prevent
>> derivatives - for example Debian distributing a modified UniProt
>> without permission.
> 
> What I was trying to say is that I don't think that that clause gives
> any value to the owners of Uniprot and other databases.
> 
> Why would Uniprot want to prevent derivative works? They'll always be
> the canonical source for the correct information.

The links provided in my other responce indicate some of the mindset behind 
this. I think the main point is that the work has to be attributed, and that 
any changes to such data need permission of Uniprot, likely so any content 
changes can be curated and (possibly) propogated to future releases. This also 
ensures that a set of files from a third-party containing the Uniprot name will 
not be modified (e.g. all content can be trusted as coming from Uniprot w/o 
modification).  

I have seen instances where loose data control (such as annotation from a newly 
sequenced genome) become balkanized to the point that no one can clearly state 
who is the trusted source (even when the list of sources includes large 
databases such as NCBI/EBI).  So I understand the reasoning for the license, 
but I also see Science Commons is recommending something less strict.

> You are free to distribute a modified version of the man-page for ls(1)
> - but if you introduce errors in it or make it worse, nobody will choose
> your derived version.

That's a straw man argument; man page documentation for an app is not the same 
as a database record based on scientific data.  Woud you make the same argument 
(allow free content modification) for a scientific publication?  I would, but 
only for corrections or for new data that support/contradict the original data, 
and even then it must go through some sort of mediation (an editor for 
instance), not unlike what a database curator does.

>> The ontologies are similar, but do allow for the use case of importing
>> terms from one ontology into another if the ontology name is changed
>> (and preferably if cross-references to the original are provided).
> 
>> Again, the need is to protect the integrity of the original ontology
>> content so references to a GO term or a UniProt entry are clearly
>> defined.
> 
> I think the problem that is being protected against is non-existing.
> 
> People don't want to break stuff that works, they want to be able to fix
> stuff that doesn't.

Simply opening the licensing up for any content modification doesn't solve the 
problem in the case of scientific databases, it potentially exacerbates it.  
Hence the variations in the licensing in the previous links I sent.  By the 
way, if you think the classic 'vi vs emacs' arguments can get out of control, 
see what happens when you have competing groups trying to make changes to a 
sequence record w/o curation.

I do agree that it would be nice for the barrier to database modification to be 
lowered. Many previous attempts have been made at doing this, such as including 
third-party annotation, but with the major databases they all seem to fall by 
the wayside and they seem to fall back to simple curation. 

Maybe it's time to come up with a git/hg for biological data, where one could 
fork records and make changes for submission; at least there one could have a 
trusted source and easier paths to data modification.  Just a thought.

>> This is essential for many of the public bioinformatics databases.
> 
> Why? Only a hypothetical derivative would be changed, not the original.
> 
> If someome distributed a derivative that was broken, I think people
> would quickly abandon it.

How could one tell the difference if both versions are implied to come from 
Uniprot (even if one comes from a third/fourth/fifth party)?  There is no 
guarantee beyond going back and comparing the records to the original Uniprot 
data.  

> Again, just my point of view - not representing or speaking for anyone :-)
> 
> 
>  Best regards,
> 
>    Adam


chris

Christopher Fields
Senior Research Scientist
National Center for Supercomputing Applications
Institute for Genomic Biology
University of Illinois Urbana-Champaign
1206 W. Gregory Dr. , MC-195
Urbana, IL 61801



_______________________________________________
EMBOSS mailing list
[email protected]
http://lists.open-bio.org/mailman/listinfo/emboss

Re: [EMBOSS] Files included in EMBOSS but licensed ...

Reply via email to