2018-08-02 10:54 EDT, <[email protected]>: > Since this thread was highlighted in yesterday's Carpentry Clippings, I'll > bet I'm not the last to jump in today, so I'll be brief. > > DVC <http://github.com/iterative/dvc.git> was mentioned at the beginning, > but I gather few here have given it a try. I encourage you to take a look. > The tool is still in alpha, but developing quickly with a lot of potential. > What I like about DVC: > > - Works in parallel to git and is similar to git LFS in > cloning/pushing/pulling references to data files > - Data files are not tracked by git; your code repository remains just > that > - Supports external data sources (since 0.10.0 > <https://github.com/iterative/dvc/releases/tag/0.10.0>); do you really > want a copy of your data *within* every repo that reads it? > - Supports multiple cloud data sources (e.g. Amazon S3) > - Does not default to "publishing" data on GitHub. GitHub is no > Dataverse or Figshare (... data discoverability, yada yada) > - It's a makefile alternative too! > > DVC looks nice and easy to grasp, and is not that far from Git-LFS. Being able to use S3 or whatever else for storage is huge, because there are very few options for LFS servers (only opensource option I know of is built into GitLab). Adding new backends looks very straightforward (easier than patching git-annex).
It keeping track of workflows might or might not matter to you. It's nice to have, and definitely useful if your project happens to be a data science kind of workflow, but if you just need to share data files you won't use it. But it won't get in the way. Likewise, absence of integration with Git might or might not be a good thing. It is nice to be able to see changes to CSVs right from git-diff when using LFS. If your files are not diffable, you won't miss it. Git operations are certainly faster without this machinery. I personally like that the pointer files (.dvc) have a different filename than the data files. This causes me constant headaches when using Git-LFS (do I need to "lfs checkout"? Do I need to "git reset" the data out of my Git index?). DVC seems close to Datalad, which has been mentioned once in this thread. Has anyone here used that in practice? It seems to be a more complex option, though it might be more powerful. It seems more deeply integrated with Git, in that runs will create commits and branches directly, more than just updating .dvc files that it is your responsibility to check in. Best -- Rémi ------------------------------------------ The Carpentries: discuss Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M096cd2663242ccc1a93693ca Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription
