Really nice to see DataLad being mentioned! Disclaimer: I am one of the DataLad founders/developers, so my opinions are obviously biased.
First of all, I want to say that development of DataLad was largely inspired by both software distributions (and our experience in Debian and our NeuroDebian projects) which make software tools readily available with clean versioning and through unified interface, and version control systems which we have been using for all daily work for probably almost two decades by now (started to use Git in 2007, before that CVS and then SVN, little bzr). With DataLad we first of all wanted to establish a data distribution, but because we use and love Git, while working on DataLad we realized that now we got a full blown data version control/management system, rather just a "data distribution". Let me go through some bullet points, which relate to other posts in the thread, with the hope that it might be informative and help you to make an informative decision on the choice of the Version control system for data. To no degree this is a complete presentation of git-annex and/or DataLad. I would refer you to our (possibly incomplete) documentation and examples at http://datalad.org . But whatever system you choose, my main message would be * * *"PLEASE DO VERSION CONTROL YOUR DATA!"* and do not leave data a "2-nd class citizen" in the digital objects of your research/work. * "Distributed data" - the main aspect to choose git-annex was to be able to provide access to already available publicly data. When we started working on DataLad, there were no other general solution which would make it possible to "hook" into existing data portals. We simply could not duplicate on our server all data for re-distribution, similarly to how we do with software in e.g. Debian. ATM almost all data for datasets provided from http://datasets.datalad.org come from a wide variety of original locations (S3 buckets, http websites, custom data portals, ...) through the unified git-annex/DataLad interfaces. datalad crawl command could be used to quickly establish yourself a git/git-annex repository from standard S3 bucket or a simple website, or you could provide custom crawling pipelines (https://github.com/datalad/datalad-crawler). The majority of the datasets on http://datasets.datalad.org are created and updated via datalad crawl command, and later published to that website via ssh using datalad publish. So you could get yourself a similar "data portal" within minutes if you have something to share. ** Experimentation* We often love git for code development since it allows to experiment easily: create a branch, through new functionality against the wall, see if it sticks, if does - merge. The same practice often applies to data analyses: we want to try different processing parameters, new algorithm, etc. Keeping incoming data, code, and output results under the same VCS allows to establish a clear track of how any specific results were obtained. The beauty of git-annex (and DataLad ;-)) that you still use Git while working with your data. Some git functions listed below become a "god blessing" for experimentation: ** git checkout BRANCH I guess I should not describe the utility of this functionality in git. But what is great when working with git-annex, is that checkouts of alternative branches are SUPER FAST regardless of how big your data files (under git-annex control) are, because they are just symlinks. You can also literally re-layout the entire tree within seconds, if data files are annotated with metadata in git-annex. If you would like to see/try it, just do git clone http://datasets.datalad.org/labs/openneurolab/metasearch/.git cd metasearch; ls # or tree git annex view species=* sex=* handedness=* ls # or tree git checkout master # to appreciate the speed git checkout - So you could "explore" and manipulate the dataset even without fetching any data yet (use datalad get or git-annex get to get files of interest). ** git reset --hard SOMEWHEREINTHEPAST is great! So many times I do something, possibly still in master, and then want to get rid of it, or rerun it. git reset --hard is my friend, and it works just wonderful with git-annexed files -- super fast, etc. git clean -dfx helps to keep everything tidy. ** datalad run (mentioned above) and datalad rerun commands I use datalad run more and more now, whenever I get any outputs produced by running a command. It just makes it so easy to make a clear record in the git history identifying what command lead to that change. datalad rerun could then be used to reconstruct the entire history (merges support is WiP), or rerun commands on top of current point in git, happen my tool changed or I am in a different environment. Related is the https://github.com/datalad/datalad-container extension. The idea is that if reproducibility is the goal, we should keep also entire computing environments (singularity or docker images) under VCS as well! And since git-annex does not care what kind of file you keep there - everything works smooth. Now we can be sure that we use the same environment locally and on HPC, and all changes recorded in git history, and we have clean ways to transfer between our computing infrastructure. * * ** Integrations/Collaborations* The power and the curse (somewhat) of git-annex is its breadth of coverage of external storage solutions. You could manage data content spread across a variety of "remotes" -- from regular ssh-reachable hosts, S3 buckets, google drive etc (see https://github.com/DanielDent/git-annex-remote-rclone), etc. And you could provide custom additions, like we did for accessing data provided online in tar/zip-balls by many portals. Literally any available online dataset could be made accessible via git-annex, and it could support any available online/cloud storage portal. The main beauty is that repository remains a Git repository, so you could publish it on github. Try e.g. datalad install -g https://github.com/psychoinformatics-de/studyforrest-data-phase2 to get yourself all 14.5GB of that dataset hosted on github with data flowing from some regular http servers. Other example could be OpenNeuro project, which is switching to use DataLad for data management backend where data offloaded to a versioned S3 bucket (s3://openneuro.org), while git repos shared on github (https://github.com/OpenNeuroDatasets; still WiP so some rough corners to polish. ** figshare - "publish" your datasets using datalad http://docs.datalad.org/en/latest/generated/man/datalad-export-to-figshare.html So you can publish your dataset as a tarball to figshare, and then your content locally would be linked to that tarball (so you can publish your git repo to github etc). Figshare (as well as zenodo) are suboptimal for data which is actively changing since published dataset there cannot be changed. Also they do not support directory structure, that is why we publish tarballs. Our export to figshare could be improved to provide a flattened collection of files instead of a tarball though (contributions are welcome). ** OSF - someone needs to find time to provide support for it I guess ** Internal - students use DataLad to obtain/transfer/sync data between incoming data server, lab cluster, institutional HPC cluster. The beauty of git annex allows to keep the entire git repository on HPC pretty much indefinitely while just dropping it from HPC to not consume precious space at HPC, while being able to get it there again happen they need to rerun analysis. All the changes are strongly version controlled, so they never loose track of "which version of data I need" or "on what version of data I have ran the analysis" ** https://web.gin.g-node.org/ have a look at this "github clone" which was extended with git-annex support, happen you want to have git-annex aware github instance of your own. ** Caching - only recently exercised as the opportunity. See https://github.com/datalad/datalad/issues/2763 and references there in. Relates also to experimentation. It is very quick to clone a dataset locally or from the web, but then you might be duplicating data which is already available on the filesystem. With that "caching remote" it would be possible to take advantage of hardlinks/CoW filesystems and experiment with datasets as quickly as experimenting with the code, without fear of ruining some original centralized original version of the dataset. ** Modularity (full study history/objects)* ** standard git submodules - similar to per-file (get/drop), you can install/uninstall subdatasets In DataLad, after trying a few other ideas, we decided just to use a standard git submodule mechanism for "composition" of subdatasets in something bigger (we call them super-datasets): - as pointed out in above comments, the entire http://datasets.datalad.org is just a git repository with submodules, establishing the full hierarchy of datasets with clear versioned associations - if you have a study-dedicated VCS, you can then install (as a submodule) your input datasets (from other places), provide new sub-datasets for derived data (e.g. preprocessed) and results, possibly reusing those as independent pieces in follow up studies. Everything is version controlled, clear versioning association, etc - include as submodule a dataset with your favorite singularity/docker images! ;) ** "Open standard" * all git-annex'ed information is within git-annex branch, easy to understand happen someone would want to reimplement some git-annex functionality Sorry that it came out a bit too long, but hopefully some people might find it useful. ------------------------------------------ The Carpentries: discuss Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M015d6539e85b09244e739bff Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription
