Good question, I'd also be interested in comments on this.

I'd second Dav's comments that it depends on file size, and certainly for <
100 MB files, simply committing these to git seems like the most reasonable
way to go.

Workflow-wise, I find Git LFS very compelling, but in practice, I found it
not to be viable for public GitHub projects in which you expect forks and
PRs.  GitHub's pricing model basically means that Git LFS breaks the fork /
PR workflow (see
https://medium.com/@megastep/github-s-large-file-storage-is-no-panacea-for-open-source-quite-the-opposite-12c0e16a9a91)
 You can set up a different source (i.e. GitLab) to host the LFS part and
still have your repo on GitHub, see https://github.com/jimhester/test-glfs/,
but this was sufficiently cumbersome that I could not get it to work.

I have not experimented with Git Annex or Dat, but my understanding is that
while these provide version control solution, they do not provide a
file-storage solution.  Dat is a peer-to-peer model, which I believe means
you need some 'peer' server always on and running somewhere when you want
to access your data.  My own need is almost the inverse of this problem --
I am primarily looking for a mechanism to easily share data associated with
a project that already lives on GitHub (possibly public, possibly private),
and I want a way to give collaborators / students access to both download
and upload the data without asking them to adopt a workflow of tools that
is any more complicated than it needs to be.  e.g. sticking the data on
Amazon S3 is often good enough -- I can version data linearly with file
names, I do not need git merge capabilities -- but this does impose a
significant overhead for new users with needing to use aws cli or similar
and set up more authentication tokens.   A small barrier but enough to
discourage collaborators.

My recent approach has been to piggyback > 100 MB files directly on GitHub
as 'assets', which can be up to 2 GB in size.  This is not a robust
versioning solution (I believe that public, archival research data ought to
be deposited in a *data archive* and versioned there), and may not be a
good idea at all, but can be remarkably convenient for certain use cases
(like keeping your 100mb ~ 2gb spatial data shape files associated with the
repo where you're analyzing them).  Not to subvert this thread, but if
you're curious about this approach using R, I have a little package to
facilitate this workflow: https://github.com/cboettig/piggyback ;
feedback/critique welcome.

Cheers,

Carl


On Fri, Jul 20, 2018 at 2:48 PM thompson.m.j via discuss <
[email protected]> wrote:

> Hello all,
> I am a member of a computational biology lab that models processes in
> developmental biology and cell signaling and calibrates these models with
> microscopy data. I've recently gotten into using version control using git
> for our codes, and I am now trying to determine the best course of action
> to take for the data. These are the tools I'm aware of but have not tested:
>
> The Dat Project https://datproject.org/
> Git Large File Storage https://git-lfs.github.com/
> Git Annex https://git-annex.branchable.com/
> Data Version Control (DVC) https://dvc.org/
>
> All projects seem to be aimed at researchers trying to integrate data
> versioning into their workflow and collaboration, and some seem to have a
> few other bells and whistles.
>
> Now, the only reason I settled on using git for my work is that it seems
> to be the de facto standard version control just about the whole world
> uses. Using this same reasoning, does anyone here have a keen insight into
> which of the data versioning tools listed here or otherwise is (or will
> most likely become) the standard for data version control?
> *The Carpentries <https://carpentries.topicbox.com/latest>* / discuss /
> see discussions <https://carpentries.topicbox.com/groups/discuss> +
> participants <https://carpentries.topicbox.com/groups/discuss/members> + 
> delivery
> options <https://carpentries.topicbox.com/groups/discuss/subscription>
> Permalink
> <https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M26854e6b9b3500ea27de1bc9>
>
-- 

http://carlboettiger.info

------------------------------------------
The Carpentries: discuss
Permalink: 
https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Me96e4d3cbc9ff7c08c4d2d76
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Reply via email to