Good question, I'd also be interested in comments on this. I'd second Dav's comments that it depends on file size, and certainly for < 100 MB files, simply committing these to git seems like the most reasonable way to go.
Workflow-wise, I find Git LFS very compelling, but in practice, I found it not to be viable for public GitHub projects in which you expect forks and PRs. GitHub's pricing model basically means that Git LFS breaks the fork / PR workflow (see https://medium.com/@megastep/github-s-large-file-storage-is-no-panacea-for-open-source-quite-the-opposite-12c0e16a9a91) You can set up a different source (i.e. GitLab) to host the LFS part and still have your repo on GitHub, see https://github.com/jimhester/test-glfs/, but this was sufficiently cumbersome that I could not get it to work. I have not experimented with Git Annex or Dat, but my understanding is that while these provide version control solution, they do not provide a file-storage solution. Dat is a peer-to-peer model, which I believe means you need some 'peer' server always on and running somewhere when you want to access your data. My own need is almost the inverse of this problem -- I am primarily looking for a mechanism to easily share data associated with a project that already lives on GitHub (possibly public, possibly private), and I want a way to give collaborators / students access to both download and upload the data without asking them to adopt a workflow of tools that is any more complicated than it needs to be. e.g. sticking the data on Amazon S3 is often good enough -- I can version data linearly with file names, I do not need git merge capabilities -- but this does impose a significant overhead for new users with needing to use aws cli or similar and set up more authentication tokens. A small barrier but enough to discourage collaborators. My recent approach has been to piggyback > 100 MB files directly on GitHub as 'assets', which can be up to 2 GB in size. This is not a robust versioning solution (I believe that public, archival research data ought to be deposited in a *data archive* and versioned there), and may not be a good idea at all, but can be remarkably convenient for certain use cases (like keeping your 100mb ~ 2gb spatial data shape files associated with the repo where you're analyzing them). Not to subvert this thread, but if you're curious about this approach using R, I have a little package to facilitate this workflow: https://github.com/cboettig/piggyback ; feedback/critique welcome. Cheers, Carl On Fri, Jul 20, 2018 at 2:48 PM thompson.m.j via discuss < [email protected]> wrote: > Hello all, > I am a member of a computational biology lab that models processes in > developmental biology and cell signaling and calibrates these models with > microscopy data. I've recently gotten into using version control using git > for our codes, and I am now trying to determine the best course of action > to take for the data. These are the tools I'm aware of but have not tested: > > The Dat Project https://datproject.org/ > Git Large File Storage https://git-lfs.github.com/ > Git Annex https://git-annex.branchable.com/ > Data Version Control (DVC) https://dvc.org/ > > All projects seem to be aimed at researchers trying to integrate data > versioning into their workflow and collaboration, and some seem to have a > few other bells and whistles. > > Now, the only reason I settled on using git for my work is that it seems > to be the de facto standard version control just about the whole world > uses. Using this same reasoning, does anyone here have a keen insight into > which of the data versioning tools listed here or otherwise is (or will > most likely become) the standard for data version control? > *The Carpentries <https://carpentries.topicbox.com/latest>* / discuss / > see discussions <https://carpentries.topicbox.com/groups/discuss> + > participants <https://carpentries.topicbox.com/groups/discuss/members> + > delivery > options <https://carpentries.topicbox.com/groups/discuss/subscription> > Permalink > <https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M26854e6b9b3500ea27de1bc9> > -- http://carlboettiger.info ------------------------------------------ The Carpentries: discuss Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Me96e4d3cbc9ff7c08c4d2d76 Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription
