So this touches on something that Shawn Ross has been working on for MQ in terms of data repositories.
Dataverse (OSS) and Figshare are good institutional data repos. I prefer dataverse because of its quality of metadata and OSS heritage. Figshare, being commercial, worries me as a long term provider. I wouldn't use either as a living version control repo though. OSF.io is excellent as a publishing archive, and is a great frontend to whatever storage you decide to use. Git lfs is free on github if you authenticate with education.github.com (but that's not widely advertised). It is also supported on gitlab and bitbucket. The main question here, however, is: is your data binary or ASCII?. Git doesn't really have many advantages on binary data. It may be worth using a proper SQL database to maintain this data and to store data-dumps from the database. But, before we can explore more deeply, we need to characterise your data and how you plan to use version control with it. Is it: Relational? Text? What size? Sparse (lots of nulls?) And in terms of questions you'll be asking of the data: In present version with prior for recovery or how it changes over time? What software tools will you be using with the data? Will you be using "the cloud?" or other HPC? Will you be accessing the full dataset every time, or will you be doing lookups on subsets of the data? ________________________________ From: thompson.m.j via discuss <[email protected]> Sent: Saturday, 21 July 2018 2:08:01 AM To: discuss Subject: [discuss] Version control and collaboration with large datasets. Hello all, I am a member of a computational biology lab that models processes in developmental biology and cell signaling and calibrates these models with microscopy data. I've recently gotten into using version control using git for our codes, and I am now trying to determine the best course of action to take for the data. These are the tools I'm aware of but have not tested: The Dat Project https://datproject.org/<https://protect-au.mimecast.com/s/IxL1CL7Eg9fRBjODHB89S_?domain=datproject.org> Git Large File Storage https://git-lfs.github.com/<https://protect-au.mimecast.com/s/eWhwCMwGj8CqBO8yhk2ugp?domain=git-lfs.github.com> Git Annex https://git-annex.branchable.com/<https://protect-au.mimecast.com/s/DEqRCNLJxki0klxqhjyNKa?domain=git-annex.branchable.com> Data Version Control (DVC) https://dvc.org/<https://protect-au.mimecast.com/s/zfOGCOMK7Ycp1RXLTrETc6?domain=dvc.org> All projects seem to be aimed at researchers trying to integrate data versioning into their workflow and collaboration, and some seem to have a few other bells and whistles. Now, the only reason I settled on using git for my work is that it seems to be the de facto standard version control just about the whole world uses. Using this same reasoning, does anyone here have a keen insight into which of the data versioning tools listed here or otherwise is (or will most likely become) the standard for data version control? The Carpentries<https://protect-au.mimecast.com/s/d8zKCRONg6sv4mlDHQC7IM?domain=carpentries.topicbox.com> / discuss / see discussions<https://protect-au.mimecast.com/s/-_iRCVARmOHxqwKnTEtZLi?domain=carpentries.topicbox.com> + participants<https://protect-au.mimecast.com/s/LOxUCWLVn6i5r9YLFOY3jP?domain=carpentries.topicbox.com> + delivery options<https://protect-au.mimecast.com/s/M8I4CQnM1WfkvAq1FAXxlQ?domain=carpentries.topicbox.com> Permalink<https://protect-au.mimecast.com/s/t9S1CP7L1NfKzGm1u6Oghk?domain=carpentries.topicbox.com> ------------------------------------------ The Carpentries: discuss Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Ma7b92cfc00a5d9f102cfc2c2 Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription
