Data versioning is very difficult as all data sources do not keep "old" versions online, only a current one.

With biomaj we propose to keep old versions (or a number of old versions), but this is locally, it cannot help to reproduce an experiment with exactly the same data if remote source changed them....

Granularity, is indeed an issue. In our use of Biomaj, we see a lot of different requests from our users (biologists). Some indeed just want a few chromosomes, others expect a full database (GEO, Uniprot etc...) Furthermore, you need to be sure to have the infrastucture to old the data (downloading full genbank... is quite big).

While you will often check this when you download data manually, the ease of use of a package could "skip" this check from the user. Or he should be warned at install of disk requirements....


Le 2/16/11 6:16 PM, Scott Christley a écrit :
I don't disagree, in principle.  There are many nice aspects to the debian packaging as 
you indicate.  We don't want to replicate the 100s of terabytes of data into the debian 
repository, so any "package" would not have the real data but would download 
the data from its source during the package install.  Maybe through pre/post install 
scripts?  I'm not overly familiar with those capabilities but it seems plausible to me.

However, it does leave open an interesting question.  Exactly what granularity of data belongs in a 
"package"?  A genome sounds good, but there are already thousands of genomes.  There are 
thousands of microarray experiments.  And there are millions of sequence entries in GenBank.  It is 
plausible that the user would want access to individual sequences.  So the idea of managing 
thousands of "packages" starts to sound pretty cumbersome.

Versioning of data is definitely an important issue that is somewhat overlooked.  
Especially if scientists want to reproduce results from another researcher or in a paper, 
if you try to redo an experiment from many years ago, newer data could produce different 
results.  Galaxy[1] is one effort to get scientists to catalog reproducible workflows, 
and while it has some support for acquiring data, its main focus is on the analysis 
process.  I think the issue of "workflow governance" is still an open question.



On Feb 15, 2011, at 6:18 PM, Yaroslav Halchenko wrote:

well -- this issue is tangentially related to the software: why should
we care about having Debian packages while there are CRAN, easy_install,
etc -- all those great tools to deploy software -- domain specific and
created by specialists.  Although such comparison is a stretch, I think
it has its own merits.  Encapsulating (at least core sets) data into
Debian packages makes them nicely integrated within the world of
software withing Debian; with clear and uniform means on how to specify
dependencies on data, on how to install, where to look for legal
information, the same canonical location for related software and data
etc.  Versioned dependencies become especially relevant aspect is
construction of regression tests of software depending on
corresponding data packages, e.g.

I am not suggesting to replace all those data provider systems created
by professionals ;)  I am talking about complimenting them
whenever feasible/sensible for the Debian needs/purposes.

On Tue, 15 Feb 2011, Scott Christley wrote:

I think putting the data itself into debian repository is problematic.  
Regardless of any licensing issue, the shear amount of data is too great.  
Better to let the professionals who are getting paid to manage the data (NCBI, 
KEGG, etc.) and download directly from those sites.  Pretty much all of them 
have ftp/http access to acquire data.
I like the getData effort.  Have a set of "data descriptors" with information 
about how/where to get data, then when requested performs the download.  This is very 
much the architecture I was thinking about.  I see a number of ways the project could be 
expanded.  I would like to hear thoughts from Steffen and Charles about getData before I 
jump in with a bunch of additions.
The biomaj projects looks interesting as well.  One possibility is to use it as the 
underlying data retrieval layer, but it also may be "too complex" for basic 
retrieval functions.
Keep in touch                           
Yaroslav Halchenko       

gpg key id: 4096R/326D8438  (
Key fingerprint = 5FB4 6F83 D3B9 5204 6335  D26D 78DC 68DB 326D 8438

To UNSUBSCRIBE, email to
with a subject of "unsubscribe". Trouble? Contact

Reply via email to