Re: [discuss] Version control and collaboration with large datasets.

Brian Ballsun-Stanton Tue, 31 Jul 2018 17:37:23 -0700

Since this directly applies, I saw this note in my rss reader (newsblur) today: 
https://blog.github.com/2018-07-30-git-lfs-2.5.0-now-available/

Specifically their git lfs migrate import --fixup  which allows for dealing 
with that dreaded "You cannot push to (whatever) because your repository is too 
big." Also, because this isn't advertised anywhere, if you authenticate with 
education.github.com, you can have unlimited private repos and space.

(And, since it trips me up every time, remember that after "installing" git-lfs 
from package cloud on linux, you still have to apt install git-lfs, since 
package cloud only adds the repo)

________________________________
From: Jon Pipitone <[email protected]>
Sent: Thursday, 26 July 2018 1:19:45 AM
To: discuss
Subject: Re: [discuss] Version control and collaboration with large datasets.

>My five cents is, that it really depends on the characteristics of your
>data (e.g. size) and the goal you try to achieve by versioning your
>data.

+1 to thinking carefully about what your goals are here before jumping
to any particular tool.

My experience: I found myself re-organizing all my lab's neuroimaging
data starting from data collected when it was a single grad student up
to when it was housing data from multiple studies and multiple sites of
data collection. We opted to begin by first organizing the data with
sensible naming scheme on a shared drive, as Lars describes, because it
was immediately accessible to everyone in the lab regardless of their
tech know-how, and was also a necessary starting point regardless of
whether we later adopted a fancer data versioning/sharing technology. We
did later use a neuroimaging-specific system for sharing our data with
others, but retained the filesystem organization in addition because it
was familiar, and so darn convenient for scripting, documentation, etc.

Jon.

On 07/23/18, [email protected] wrote:
> Hi,
>
> My five cents is, that it really depends on the characteristics of your data 
> (e.g. size) and the goal you try to achieve by versioning your data.
>
> Examples:
>
> Size: If e.g. the datasets are "small", they can easily be handled by git. 
> For larger datasets, it depends on what is important to you. E.g. a shared 
> network file system with proper backup and well-defined naming scheme can be 
> totally fine in some cases, while a proper data repository issuing DOIs or 
> similar is needed in other cases. If synchronization speed, as well as 
> optimized storage, is important, something like dat or IPFS is advisable.
>
> Purpose: Similarly, if your goal is to share data with collaborators, then a 
> simple HTTPS link is the easiest (hosted on e.g. GitHub, AWS, or a data 
> repository).
>
> Cheers,
> Lars

------------------------------------------
The Carpentries: discuss
Permalink: 
https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M5489167c4c6220100f4abc5a<https://protect-au.mimecast.com/s/SmSZC81Vq2C6DNXBI2ZUw7?domain=carpentries.topicbox.com>
Delivery options: 
https://carpentries.topicbox.com/groups/discuss/subscription<https://protect-au.mimecast.com/s/lwAIC91W8rCkL4zrhO4e1D?domain=carpentries.topicbox.com>

------------------------------------------
The Carpentries: discuss
Permalink: 
https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M37515a5553a4c80373ac40d0
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Re: [discuss] Version control and collaboration with large datasets.

Reply via email to