Re: [discuss] Version control and collaboration with large datasets.

site-topicbox Thu, 06 Sep 2018 09:04:44 -0700

Really nice to see DataLad being mentioned!
Disclaimer: I am one of the DataLad founders/developers, so my opinions are 
obviously biased.


First of all, I want to say that development of DataLad was largely inspired by 
both software distributions (and our experience in Debian and our NeuroDebian 
projects) which make software tools readily available with clean versioning and 
through unified interface, and version control systems which we have been using 
for all daily work for probably almost two decades by now (started to use Git 
in 2007, before that CVS and then SVN, little bzr).  With DataLad we first of 
all wanted to establish a data distribution, but because we use and love Git, 
while working on DataLad we realized that now we got a full blown data version 
control/management system, rather just a "data distribution".

Let me go through some bullet points, which relate to other posts in the 
thread, with the hope that it might be informative and help you to make an 
informative decision on the choice of the Version control system for data.  To 
no degree this is a complete presentation of git-annex and/or DataLad.  I would 
refer you to our (possibly incomplete) documentation and examples at 
http://datalad.org .

But whatever system you choose, my main message would be
*
*
*"PLEASE DO VERSION CONTROL YOUR DATA!"*

and do not leave data a "2-nd class citizen" in the digital objects of your 
research/work.


* "Distributed data" - the main aspect to choose git-annex was to be able to 
provide access to already available publicly data.

When we started working on DataLad, there were no other general solution which 
would make it possible to "hook" into existing data portals.  We simply could 
not duplicate on our server all data for re-distribution, similarly to how we 
do with software in e.g. Debian. ATM almost all data for datasets provided from 
http://datasets.datalad.org come from a wide variety of original locations (S3 
buckets, http websites, custom data portals, ...) through the unified 
git-annex/DataLad interfaces. datalad crawl  command could be used to quickly 
establish yourself a git/git-annex repository from standard S3 bucket or a 
simple website, or you could provide custom crawling pipelines 
(https://github.com/datalad/datalad-crawler). The majority of the datasets on 
http://datasets.datalad.org are created and updated via datalad crawl  command, 
and later published to that website via ssh using  datalad publish.  So you 
could get yourself a similar "data portal" within minutes if you have something 
to share.

** Experimentation*

We often love git for code development since it allows to experiment easily:  
create a branch, through new functionality against the wall, see if it sticks, 
if does - merge.  The same practice often applies to data analyses: we want to 
try different processing parameters, new algorithm, etc. Keeping incoming data, 
code, and output results under the same VCS allows to establish a clear track 
of how any specific results were obtained.  The beauty of git-annex (and 
DataLad ;-)) that you still use Git while working with your data.  Some git 
functions listed below become a "god blessing" for experimentation:

** git checkout BRANCH

I guess I should not describe the utility of this functionality in git.  But 
what is great when working with git-annex, is that checkouts of alternative 
branches are SUPER FAST regardless of how big your data files (under git-annex 
control) are, because they are just symlinks.  You can also literally re-layout 
the entire tree within seconds, if data files are annotated with metadata in 
git-annex.  If you would like to see/try it, just do

    git clone http://datasets.datalad.org/labs/openneurolab/metasearch/.git
    cd metasearch; ls # or tree
    git annex view species=* sex=* handedness=*
    ls # or tree
    git checkout master  # to appreciate the speed
    git checkout -

So you could "explore" and manipulate the dataset even without fetching any 
data yet (use datalad get or git-annex get to get files of interest).

** git reset --hard SOMEWHEREINTHEPAST  is great!

So many times I do something, possibly still in master, and then want to get 
rid of it, or rerun it.
git reset --hard is my friend, and it works just wonderful with git-annexed 
files -- super fast, etc.  git clean -dfx  helps to keep everything tidy.

** datalad run (mentioned above)  and  datalad rerun  commands

I use  datalad run  more and more now, whenever I get any outputs produced by 
running a command.  It just makes it so easy to make a clear record in the git 
history identifying what command lead to that change.  datalad rerun  could 
then be used to reconstruct the entire history (merges support is WiP), or 
rerun commands on top of current point in git, happen my tool changed or I am 
in a different environment.

Related is the https://github.com/datalad/datalad-container extension.  The 
idea is that if reproducibility is the goal, we should keep also entire 
computing environments (singularity or docker images) under VCS as well!  And 
since git-annex does not care what kind of file you keep there - everything 
works smooth.  Now we can be sure that we use the same environment locally and 
on HPC, and all changes recorded in git history, and we have clean ways to 
transfer between our computing infrastructure.
*
*
** Integrations/Collaborations*

The power and the curse (somewhat) of git-annex is its breadth of coverage of 
external storage solutions.  You could manage data content spread across a 
variety of "remotes" -- from regular ssh-reachable hosts, S3 buckets, google 
drive etc (see https://github.com/DanielDent/git-annex-remote-rclone), etc.  
And you could provide custom additions, like we did for accessing data provided 
online in tar/zip-balls by many portals.  Literally any available online 
dataset could be made accessible via git-annex, and it could support any 
available online/cloud storage portal.  The main beauty is that repository 
remains a Git repository, so you could publish it on github.  Try e.g.

    datalad install -g 
https://github.com/psychoinformatics-de/studyforrest-data-phase2

to get yourself all 14.5GB of that dataset hosted on github with data flowing 
from some regular http servers.  Other example could be OpenNeuro project, 
which is switching to use DataLad for data management backend where data 
offloaded to a versioned S3 bucket (s3://openneuro.org), while git repos shared 
on github (https://github.com/OpenNeuroDatasets; still WiP so some rough 
corners to polish.

** figshare - "publish" your datasets using datalad

http://docs.datalad.org/en/latest/generated/man/datalad-export-to-figshare.html

So you can publish your dataset as a tarball to figshare, and then your content 
locally would be linked to that tarball (so you can publish your git repo to 
github etc).  Figshare (as well as zenodo) are suboptimal for data which is 
actively changing since published dataset there cannot be changed. Also they do 
not support directory structure, that is why we publish tarballs.  Our export 
to figshare could be improved to provide a flattened collection of files 
instead of a tarball though (contributions are welcome).

** OSF - someone needs to find time to provide support for it I guess

** Internal - students use DataLad to obtain/transfer/sync data between 
incoming data server, lab cluster, institutional HPC cluster.

The beauty of git annex allows to keep the entire git repository on
HPC pretty much indefinitely while just dropping it from HPC to not
consume precious space at HPC, while being able to get it there again
happen they need to rerun analysis.  All the changes are strongly
version controlled, so they never loose track of "which version of
data I need" or "on what version of data I have ran the analysis"

** https://web.gin.g-node.org/

have a look at this "github clone" which was extended with git-annex support, 
happen you want to have git-annex aware github instance of your own.

** Caching - only recently exercised as the opportunity. See 
https://github.com/datalad/datalad/issues/2763 and references there in.

Relates also to experimentation.  It is very quick to clone a dataset locally 
or from the web, but then you might be duplicating data which is already 
available on the filesystem.  With that "caching remote" it would be possible 
to take advantage of hardlinks/CoW filesystems and experiment with datasets as 
quickly as experimenting with the code, without fear of ruining some original 
centralized original version of the dataset.

** Modularity (full study history/objects)*

** standard git submodules - similar to per-file (get/drop), you can 
install/uninstall subdatasets

In DataLad, after trying a few other ideas, we decided just to use a standard 
git submodule mechanism for "composition" of subdatasets in something bigger 
(we call them super-datasets):

- as pointed out in above comments, the entire http://datasets.datalad.org is 
just a git repository with submodules, establishing the full hierarchy of 
datasets with clear versioned associations
- if you have a study-dedicated VCS, you can then install (as a submodule) your 
input datasets (from other places), provide new sub-datasets for derived data 
(e.g. preprocessed) and results, possibly reusing those as independent pieces 
in follow up studies.  Everything is version controlled, clear versioning 
association, etc
- include as submodule a dataset with your favorite singularity/docker images! 
;)

** "Open standard" *

all git-annex'ed information is within git-annex branch, easy to understand 
happen someone would want to reimplement some git-annex functionality

Sorry that it came out a bit too long, but hopefully some people might find it 
useful.
------------------------------------------
The Carpentries: discuss
Permalink: 
https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M015d6539e85b09244e739bff
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Re: [discuss] Version control and collaboration with large datasets.

Reply via email to