I've used datalad (as it's a fairly thin wrapper around git annex), though I haven't really pushed the reproducibility parts of it (my main use for it is connecting different repositories together when pushing). datalad run $script for me is just a convenience over running the script then adding the output file (though I can see if you were creating a large number of files in each run, datalad run is going to be a major improvement). The unique feature of datalad seem to be the ease at which subdatasets can be managed (e.g. http://datasets.datalad.org/?dir=/openfmri is a single dataset, with quite a number of subdatasets), which means it's likely something I'm going to use more often.
James On 3 August 2018 at 04:33, Rémi Rampin <[email protected]> wrote: > 2018-08-02 10:54 EDT, <[email protected]>: >> >> Since this thread was highlighted in yesterday's Carpentry Clippings, I'll >> bet I'm not the last to jump in today, so I'll be brief. >> >> DVC was mentioned at the beginning, but I gather few here have given it a >> try. I encourage you to take a look. The tool is still in alpha, but >> developing quickly with a lot of potential. What I like about DVC: >> >> Works in parallel to git and is similar to git LFS in >> cloning/pushing/pulling references to data files >> Data files are not tracked by git; your code repository remains just that >> Supports external data sources (since 0.10.0); do you really want a copy >> of your data *within* every repo that reads it? >> Supports multiple cloud data sources (e.g. Amazon S3) >> Does not default to "publishing" data on GitHub. GitHub is no Dataverse or >> Figshare (... data discoverability, yada yada) >> It's a makefile alternative too! > > DVC looks nice and easy to grasp, and is not that far from Git-LFS. Being > able to use S3 or whatever else for storage is huge, because there are very > few options for LFS servers (only opensource option I know of is built into > GitLab). Adding new backends looks very straightforward (easier than > patching git-annex). > > It keeping track of workflows might or might not matter to you. It's nice to > have, and definitely useful if your project happens to be a data science > kind of workflow, but if you just need to share data files you won't use it. > But it won't get in the way. > > Likewise, absence of integration with Git might or might not be a good > thing. It is nice to be able to see changes to CSVs right from git-diff when > using LFS. If your files are not diffable, you won't miss it. Git operations > are certainly faster without this machinery. > > I personally like that the pointer files (.dvc) have a different filename > than the data files. This causes me constant headaches when using Git-LFS > (do I need to "lfs checkout"? Do I need to "git reset" the data out of my > Git index?). > > > > DVC seems close to Datalad, which has been mentioned once in this thread. > Has anyone here used that in practice? It seems to be a more complex option, > though it might be more powerful. It seems more deeply integrated with Git, > in that runs will create commits and branches directly, more than just > updating .dvc files that it is your responsibility to check in. > > Best > -- > Rémi > The Carpentries / discuss / see discussions + participants + delivery > options Permalink -- Don't send me files in proprietary formats (.doc(x), .xls, .ppt etc.). It isn't good enough for Tim Berners-Lee, and it isn't good enough for me either. For more information visit http://www.gnu.org/philosophy/no-word-attachments.html. Truly great madness cannot be achieved without significant intelligence. - Henrik Tikkanen If you're not messing with your sanity, you're not having fun. - James Tocknell In theory, there is no difference between theory and practice; In practice, there is. ------------------------------------------ The Carpentries: discuss Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Mc391b14e70952e72cff01775 Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription
