Re: data sets and/or access to data sets

Scott Christley Wed, 16 Feb 2011 09:16:39 -0800

I don't disagree, in principle.  There are many nice aspects to the debian 
packaging as you indicate.  We don't want to replicate the 100s of terabytes of 
data into the debian repository, so any "package" would not have the real data 
but would download the data from its source during the package install.  Maybe 
through pre/post install scripts?  I'm not overly familiar with those 
capabilities but it seems plausible to me.


However, it does leave open an interesting question.  Exactly what granularity 
of data belongs in a "package"?  A genome sounds good, but there are already 
thousands of genomes.  There are thousands of microarray experiments.  And 
there are millions of sequence entries in GenBank.  It is plausible that the 
user would want access to individual sequences.  So the idea of managing 
thousands of "packages" starts to sound pretty cumbersome.

Versioning of data is definitely an important issue that is somewhat 
overlooked.  Especially if scientists want to reproduce results from another 
researcher or in a paper, if you try to redo an experiment from many years ago, 
newer data could produce different results.  Galaxy[1] is one effort to get 
scientists to catalog reproducible workflows, and while it has some support for 
acquiring data, its main focus is on the analysis process.  I think the issue 
of "workflow governance" is still an open question.

cheers
Scott

[1] http://galaxy.psu.edu/

On Feb 15, 2011, at 6:18 PM, Yaroslav Halchenko wrote:

> well -- this issue is tangentially related to the software: why should
> we care about having Debian packages while there are CRAN, easy_install,
> etc -- all those great tools to deploy software -- domain specific and
> created by specialists.  Although such comparison is a stretch, I think
> it has its own merits.  Encapsulating (at least core sets) data into
> Debian packages makes them nicely integrated within the world of
> software withing Debian; with clear and uniform means on how to specify
> dependencies on data, on how to install, where to look for legal
> information, the same canonical location for related software and data
> etc.  Versioned dependencies become especially relevant aspect is
> construction of regression tests of software depending on
> corresponding data packages, e.g.
> http://neuro.debian.net/pkgs/fsl-feeds.html.
> 
> I am not suggesting to replace all those data provider systems created
> by professionals ;)  I am talking about complimenting them
> whenever feasible/sensible for the Debian needs/purposes.
> 
> On Tue, 15 Feb 2011, Scott Christley wrote:
> 
> 
>> I think putting the data itself into debian repository is problematic.  
>> Regardless of any licensing issue, the shear amount of data is too great.  
>> Better to let the professionals who are getting paid to manage the data 
>> (NCBI, KEGG, etc.) and download directly from those sites.  Pretty much all 
>> of them have ftp/http access to acquire data.
> 
>> I like the getData effort.  Have a set of "data descriptors" with 
>> information about how/where to get data, then when requested performs the 
>> download.  This is very much the architecture I was thinking about.  I see a 
>> number of ways the project could be expanded.  I would like to hear thoughts 
>> from Steffen and Charles about getData before I jump in with a bunch of 
>> additions.
> 
>> The biomaj projects looks interesting as well.  One possibility is to use it 
>> as the underlying data retrieval layer, but it also may be "too complex" for 
>> basic retrieval functions.
> 
>> Scott
> -- 
> =------------------------------------------------------------------=
> Keep in touch                                     www.onerussian.com
> Yaroslav Halchenko                 www.ohloh.net/accounts/yarikoptic


-- 
To UNSUBSCRIBE, email to debian-med-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/ae3c64d3-dd5d-4aa7-b58d-e281d5b8d...@mac.com

Re: data sets and/or access to data sets

Reply via email to