Re: [discuss] Version control and collaboration with large datasets.

Claudia Beleites Sat, 21 Jul 2018 09:30:50 -0700

Hi all,

I'm also very interested in learning solutions for this.

At the moment I distinguish two use cases:

- focus of project is coding (developing software/package/library) vs.

- focus of project is data analysis, with the sub-topic of projects
where various "slices" of the data are important.

**Code project**

I have one project where I use git-lfs on github (got a promo offer for
free use). The project is about *code* (R package) that however has some
100 MB binary data attached to it (it was larger at some point before I
could get smaller but equally suitable example files for some formats).
The binary data are example files in various file formats for the file
import filters the package provides. Initially, we had them in git as
well, but that horribly bloated the repo so it got unusable after a few
years. The files themselves, however, hardly need any versioning. I get
them and store them as they are, and only very occasionally is one of
those files replaced. The main point of the git-lfs storage is to make
sure that all files are where they are supposed to be without having too
much of manual hassle.
At some point I was lucky to get a github promo offer for free git-lfs
(test) usage and gave it a try - which is the current state.

Experiences:

- (due to free promo I don't have bandwidth billing trouble)

- Git is largely independent of git-lfs: you can still fork/clone the
git-only part of the repo and work with that. For the project in
question, the files stored in git-lfs are only needed for developing and
unit testing of file import filters, everything else does not need
git-lfs. I decided I don't want to force collaborators to install
git-lfs, so set up the project in a way that e.g. the file filter unit
tests check whether those files are available, and if not skips those
tests (visibly).
This does also make sense because of size restrictions for the R package
submission to CRAN, and as I'm the maintainer in the view of CRAN, I can
always make sure I properly run all tests.

- With this setup, I do not experience the collaboration trouble/broken
forking issues Peter Stéphane describes in the link in Carl's mail. At
least not for the parts of the project that are stored as "normal" git.
I've not yet had anyone trying to directly submit files that should go
into the lfs part of the repo.

- I tried to get git-lfs installed together with a private gitlab
instance (thinking we may want to use it for data-type projects), but
like Carl, I gave up. That was IIRC 3 years ago, so things may have
improved meanwhile.

For other "code-type" projects (model/algorithm development), I tend to
take a two-layered approach. Data sets that are small enough to be
shipped as example and unit test data, say, in an R package are kept
with the code. In fact, many of them are toy data computed from code,
and I just store that code. The 2nd layer are well-known example data
sets, and there I simply rely on those data sets staying available. (I'm
talking e.g. the NASA AVIRIS data sets
https://aviris.jpl.nasa.gov/data/free_data.html)
(Side note: I'm somewhat wary of papers proposing their own new
algorithm solely on their own data set, and of algorithm comparisons
based on one or few data sets)

**Data Project**

This is where I think things could be improved :-)

The majority of projects I work on are data analysis projects. I.e. we
have measurement data, do an analysis and draw conclusions, write a
report or paper.

For these projects, we tend to take a "raw data and code are real"
approach that also implies that the raw data is never changed (with the
only exception of renaming files - but the files I'm thinking off store
their orignal name, so even that can be reconstructed). So we basically
have storage and distribution needs, but not really versioning needs. We
sometimes produce pre-processed intermediate data, but that again is
defined by the code that produces this data from the raw data, and the
results are considered temporary files. If I do manual curation (mostly
excluding bad runs with certain artifacts), I produce code or data files
that say which files were excluded and for what reason. Most of this can
be and is done in an automated fashion, though.

Producing versions of this that are to be kept (such as making snapshots
of the state of data for a paper) is sufficiently infrequent to just zip
those data and have the version in the file name.

Recently, I tend to use nextcloud to share such data. We did use git for
a while, but with large amounts of data that does become cumbersome, and
we found that few collaborators were willing to learn even just the
level of git that lets them clone and pull. Owncloud/Nextcloud is a much
lower barrier in that respect.

At the moment I think what I'd like to see would be nextcloud with
commits, ignores and maybe a somewhat more distributed and less central
approach ...

Versioning binary data would be far more important for colleagues who
extensively use GUI software for their analyses: not all of the relevant
software does keep logs/recovery data (some do, though, as they are to
be used in fields like pharma where full audit trails are required).

**Data Projects II**

(Here I see huge possibilities for improvement)

OTOH, we also have some projects where it is clear that a large variety
of subsets of the data is to be requested and analysed, and we've set up
data bases for those purposes. Here again, I do dumps/backups, and in
the rare occasion that a version should be tagged that can be done to
the backup/dump. Again, these data bases are set up in a way that easily
allows adding/inserting, but changing or deleting requires admin rights
- and admin should make sure of the backup before doing any such
"surgery" to the data base.
I may say that I'm originally from a wet-lab field (chemistry): I'm
trained to work under conditions where mistakes irretrievably mess up
things. Version control and being able to undo mistakes is good and
important, but if these techniques (luxuries?) are not available at
every point, that's as it is right now.

I admit that I never bothered about implementing full audit trails - and
the supervisors I had were already suspicious whether it is worth while
bothering to set up a data base and very much against "waste of time"
such as (for code projects) unit testing and encapsulating code in
packages/libraries/their own namespace...

I've met one research institute, though, that run a full LIMS
(laboratory information management system) which however, is more suited
for situations where the same types of analyses are repeatedly done for
new samples rather than research questions where not only samples but
also analysis methods change from project to project.

But e.g. RedCap https://projectredcap.org/ produces data bases with
audit trails. (Never tried it, though).

Best,

Claudia

Claudia Beleites Chemometric Consulting
Södeler Weg 19
61200 Wölfersheim
Germany

phone: +49 (15 23) 1 83 74 18
e-mail: [email protected]
USt-ID: DE305606151

------------------------------------------
The Carpentries: discuss
Permalink:
https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M278751611e28840c648e49a9
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Re: [discuss] Version control and collaboration with large datasets.

Reply via email to