So this touches on something that Shawn Ross has been working on for MQ in 
terms of data repositories.


Dataverse (OSS) and Figshare are good institutional data repos. I prefer 
dataverse because of its quality of metadata and OSS heritage. Figshare, being 
commercial, worries me as a long term provider. I wouldn't use either as a 
living version control repo though.


OSF.io is excellent as a publishing archive, and is a great frontend to 
whatever storage you decide to use.


Git lfs is free on github if you authenticate with education.github.com (but 
that's not widely advertised). It is also supported on gitlab and bitbucket.


The main question here, however, is: is your data binary or ASCII?. Git doesn't 
really have many advantages on binary data.


It may be worth using a proper SQL database to maintain this data and to store 
data-dumps from the database.


But, before we can explore more deeply, we need to characterise your data and 
how you plan to use version control with it.


Is it:


Relational?

Text?

What size?

Sparse (lots of nulls?)


And in terms of questions you'll be asking of the data:


In present version with prior for recovery or how it changes over time?

What software tools will you be using with the data?

Will you be using "the cloud?" or other HPC?

Will you be accessing the full dataset every time, or will you be doing lookups 
on subsets of the data?




________________________________
From: thompson.m.j via discuss <[email protected]>
Sent: Saturday, 21 July 2018 2:08:01 AM
To: discuss
Subject: [discuss] Version control and collaboration with large datasets.

Hello all,
I am a member of a computational biology lab that models processes in 
developmental biology and cell signaling and calibrates these models with 
microscopy data. I've recently gotten into using version control using git for 
our codes, and I am now trying to determine the best course of action to take 
for the data. These are the tools I'm aware of but have not tested:

The Dat Project 
https://datproject.org/<https://protect-au.mimecast.com/s/IxL1CL7Eg9fRBjODHB89S_?domain=datproject.org>
Git Large File Storage 
https://git-lfs.github.com/<https://protect-au.mimecast.com/s/eWhwCMwGj8CqBO8yhk2ugp?domain=git-lfs.github.com>
Git Annex 
https://git-annex.branchable.com/<https://protect-au.mimecast.com/s/DEqRCNLJxki0klxqhjyNKa?domain=git-annex.branchable.com>
Data Version Control (DVC) 
https://dvc.org/<https://protect-au.mimecast.com/s/zfOGCOMK7Ycp1RXLTrETc6?domain=dvc.org>

All projects seem to be aimed at researchers trying to integrate data 
versioning into their workflow and collaboration, and some seem to have a few 
other bells and whistles.

Now, the only reason I settled on using git for my work is that it seems to be 
the de facto standard version control just about the whole world uses. Using 
this same reasoning, does anyone here have a keen insight into which of the 
data versioning tools listed here or otherwise is (or will most likely become) 
the standard for data version control?
The 
Carpentries<https://protect-au.mimecast.com/s/d8zKCRONg6sv4mlDHQC7IM?domain=carpentries.topicbox.com>
 / discuss / see 
discussions<https://protect-au.mimecast.com/s/-_iRCVARmOHxqwKnTEtZLi?domain=carpentries.topicbox.com>
 + 
participants<https://protect-au.mimecast.com/s/LOxUCWLVn6i5r9YLFOY3jP?domain=carpentries.topicbox.com>
 + delivery 
options<https://protect-au.mimecast.com/s/M8I4CQnM1WfkvAq1FAXxlQ?domain=carpentries.topicbox.com>
 
Permalink<https://protect-au.mimecast.com/s/t9S1CP7L1NfKzGm1u6Oghk?domain=carpentries.topicbox.com>

------------------------------------------
The Carpentries: discuss
Permalink: 
https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Ma7b92cfc00a5d9f102cfc2c2
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Reply via email to