Re: [discuss] Version control and collaboration with large datasets.

James Tocknell via discuss Fri, 03 Aug 2018 06:20:45 -0700

I've used datalad (as it's a fairly thin wrapper around git annex),
though I haven't really pushed the reproducibility parts of it (my
main use for it is connecting different repositories together when
pushing). datalad run $script for me is just a convenience over
running the script then adding the output file (though I can see if
you were creating a large number of files in each run, datalad run is
going to be a major improvement). The unique feature of datalad seem
to be the ease at which subdatasets can be managed (e.g.
http://datasets.datalad.org/?dir=/openfmri is a single dataset, with
quite a number of subdatasets), which means it's likely something I'm
going to use more often.


James

On 3 August 2018 at 04:33, Rémi Rampin <[email protected]> wrote:
> 2018-08-02 10:54 EDT, <[email protected]>:
>>
>> Since this thread was highlighted in yesterday's Carpentry Clippings, I'll
>> bet I'm not the last to jump in today, so I'll be brief.
>>
>> DVC was mentioned at the beginning, but I gather few here have given it a
>> try. I encourage you to take a look. The tool is still in alpha, but
>> developing quickly with a lot of potential. What I like about DVC:
>>
>> Works in parallel to git and is similar to git LFS in
>> cloning/pushing/pulling references to data files
>> Data files are not tracked by git; your code repository remains just that
>> Supports external data sources (since 0.10.0); do you really want a copy
>> of your data *within* every repo that reads it?
>> Supports multiple cloud data sources (e.g. Amazon S3)
>> Does not default to "publishing" data on GitHub. GitHub is no Dataverse or
>> Figshare (... data discoverability, yada yada)
>> It's a makefile alternative too!
>
> DVC looks nice and easy to grasp, and is not that far from Git-LFS. Being
> able to use S3 or whatever else for storage is huge, because there are very
> few options for LFS servers (only opensource option I know of is built into
> GitLab). Adding new backends looks very straightforward (easier than
> patching git-annex).
>
> It keeping track of workflows might or might not matter to you. It's nice to
> have, and definitely useful if your project happens to be a data science
> kind of workflow, but if you just need to share data files you won't use it.
> But it won't get in the way.
>
> Likewise, absence of integration with Git might or might not be a good
> thing. It is nice to be able to see changes to CSVs right from git-diff when
> using LFS. If your files are not diffable, you won't miss it. Git operations
> are certainly faster without this machinery.
>
> I personally like that the pointer files (.dvc) have a different filename
> than the data files. This causes me constant headaches when using Git-LFS
> (do I need to "lfs checkout"? Do I need to "git reset" the data out of my
> Git index?).
>
>
>
> DVC seems close to Datalad, which has been mentioned once in this thread.
> Has anyone here used that in practice? It seems to be a more complex option,
> though it might be more powerful. It seems more deeply integrated with Git,
> in that runs will create commits and branches directly, more than just
> updating .dvc files that it is your responsibility to check in.
>
> Best
> --
> Rémi
> The Carpentries / discuss / see discussions + participants + delivery
> options Permalink



-- 
Don't send me files in proprietary formats (.doc(x), .xls, .ppt etc.).
It isn't good enough for Tim Berners-Lee, and it isn't good enough for
me either. For more information visit
http://www.gnu.org/philosophy/no-word-attachments.html.

Truly great madness cannot be achieved without significant intelligence.
 - Henrik Tikkanen

If you're not messing with your sanity, you're not having fun.
 - James Tocknell

In theory, there is no difference between theory and practice; In
practice, there is.

------------------------------------------
The Carpentries: discuss
Permalink: 
https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Mc391b14e70952e72cff01775
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Re: [discuss] Version control and collaboration with large datasets.

Reply via email to