Re: [discuss] Version control and collaboration with large datasets.

Dav Clark via discuss Fri, 20 Jul 2018 16:19:37 -0700

I highly doubt that any of them will become *the* standard, though you may
see some convergence like we saw with NetCDF4 and HDF5. For now, the most
robust solution is probably git LFS. It's backed by a major company and
many commercial providers are competing to provide performant back ends. In
my work at Gigantum (which I'll talk about sometime soon!), we evaluated
what makes sense for most users and Git LFS was the clear answer. The data
syncronization model is simple and there are few choices beyond the level
of the repo. The standard thing is that all the files still come along with
the repo, though you can organize things so you only have copies of some
files.

Git Annex is the closest to the "right" solution for a "traditional"
workflow, IMHO. Joey Hess (who was a core Debian member for a long time)
developed it, and it's in Haskell, so the compiler is working with Joey -
who is already very smart and thoughtful. The plugin architecture, however,
means your mileage may vary and in my experience, working on Windows is not
to be taken lightly. On POSIX-y filesystems, git annex allows more
flexibility in terms of which files are where. Git Annex was chosen by the
datalad project in neuroimaging, and Joey is an advisor to that project:
http://www.datalad.org/

Dat is a whole 'nother level. Its synchronization layer is exciting, but
again probably a bit sharp-edged for an academic lab that just started
using git.

I know less about DVC. You might also throw quilt in there (
https://quiltdata.com/) - my sense is that they are trying to make it
closer to the kinds of datasets you have in R where you use the same
datasets again and again.

But to sharpen the question - it probably depends on the relationship of
the data to the code (one-to-one, one-to-many, etc.), and also the size (if
files are < 100MB you can just put them directly in regular git on
GitHub!). Also to a lesser extent the infrastructure you're using (laptops?
Shared server? network file share?), data use restrictions / privacy, etc.

I for one would be happy to read your reasoning "out loud" here.

Best,
Dav

On Fri, Jul 20, 2018 at 5:48 PM thompson.m.j via discuss <
[email protected]> wrote:

> Hello all,
> I am a member of a computational biology lab that models processes in
> developmental biology and cell signaling and calibrates these models with
> microscopy data. I've recently gotten into using version control using git
> for our codes, and I am now trying to determine the best course of action
> to take for the data. These are the tools I'm aware of but have not tested:
>
> The Dat Project https://datproject.org/
> Git Large File Storage https://git-lfs.github.com/
> Git Annex https://git-annex.branchable.com/
> Data Version Control (DVC) https://dvc.org/
>
> All projects seem to be aimed at researchers trying to integrate data
> versioning into their workflow and collaboration, and some seem to have a
> few other bells and whistles.
>
> Now, the only reason I settled on using git for my work is that it seems
> to be the de facto standard version control just about the whole world
> uses. Using this same reasoning, does anyone here have a keen insight into
> which of the data versioning tools listed here or otherwise is (or will
> most likely become) the standard for data version control?
> *The Carpentries <https://carpentries.topicbox.com/latest>* / discuss /
> see discussions <https://carpentries.topicbox.com/groups/discuss> +
> participants <https://carpentries.topicbox.com/groups/discuss/members> + 
> delivery
> options <https://carpentries.topicbox.com/groups/discuss/subscription>
> Permalink
> <https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M26854e6b9b3500ea27de1bc9>
>

------------------------------------------
The Carpentries: discuss
Permalink: 
https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Mbb70aabc93d6ea28e6776e97
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Re: [discuss] Version control and collaboration with large datasets.

Reply via email to