Re: [discuss] Version control and collaboration with large datasets.

Tim Head via discuss Sat, 21 Jul 2018 11:36:40 -0700

Hello all,

in the hopes of making it easier to use osf.io with large datasets last
summer we* had some time and funding to start building
http://osfclient.readthedocs.io/en/latest/cli-usage.html which is both a
command-line program and a Python library for osf.io. The tool works well
for gigabyte sized files and there is starting to be a small community of
people who contribute fixes and new features when something they need is
missing. It would be great to grow this further.


Maybe this removes that one last hurdle that was stopping you from putting
all your datasets on osf.io (when we asked about size limits they were
confident no one would ever reach them ... and I still don't know anyone
who has found it)

T

* we in this case is Titus Brown and me

On Sat, Jul 21, 2018 at 6:29 PM Claudia Beleites <
[email protected]> wrote:

> Hi all,
>
> I'm also very interested in learning solutions for this.
>
> At the moment I distinguish two use cases:
>
> - focus of project is coding (developing software/package/library) vs.
>
> - focus of project is data analysis, with the sub-topic of projects
> where various "slices" of the data are important.
>
>
> **Code project**
>
> I have one project where I use git-lfs on github (got a promo offer for
> free use). The project is about *code* (R package) that however has some
> 100 MB binary data attached to it (it was larger at some point before I
> could get smaller but equally suitable example files for some formats).
> The binary data are example files in various file formats for the file
> import filters the package provides. Initially, we had them in git as
> well, but that horribly bloated the repo so it got unusable after a few
> years. The files themselves, however, hardly need any versioning. I get
> them and store them as they are, and only very occasionally is one of
> those files replaced. The main point of the git-lfs storage is to make
> sure that all files are where they are supposed to be without having too
> much of manual hassle.
> At some point I was lucky to get a github promo offer for free git-lfs
> (test) usage and gave it a try - which is the current state.
>
> Experiences:
>
> - (due to free promo I don't have bandwidth billing trouble)
>
> - Git is largely independent of git-lfs: you can still fork/clone the
> git-only part of the repo and work with that. For the project in
> question, the files stored in git-lfs are only needed for developing and
> unit testing of file import filters, everything else does not need
> git-lfs. I decided I don't want to force collaborators to install
> git-lfs, so set up the project in a way that e.g. the file filter unit
> tests check whether those files are available, and if not skips those
> tests (visibly).
> This does also make sense because of size restrictions for the R package
> submission to CRAN, and as I'm the maintainer in the view of CRAN, I can
> always make sure I properly run all tests.
>
> - With this setup, I do not experience the collaboration trouble/broken
> forking issues Peter Stéphane describes in the link in Carl's mail. At
> least not for the parts of the project that are stored as "normal" git.
> I've not yet had anyone trying to directly submit files that should go
> into the lfs part of the repo.
>
> - I tried to get git-lfs installed together with a private gitlab
> instance (thinking we may want to use it for data-type projects), but
> like Carl, I gave up. That was IIRC 3 years ago, so things may have
> improved meanwhile.
>
> For other "code-type" projects (model/algorithm development), I tend to
> take a two-layered approach. Data sets that are small enough to be
> shipped as example and unit test data, say, in an R package are kept
> with the code. In fact, many of them are toy data computed from code,
> and I just store that code. The 2nd layer are well-known example data
> sets, and there I simply rely on those data sets staying available. (I'm
> talking e.g. the NASA AVIRIS data sets
> https://aviris.jpl.nasa.gov/data/free_data.html)
> (Side note: I'm somewhat wary of papers proposing their own new
> algorithm solely on their own data set, and of algorithm comparisons
> based on one or few data sets)
>
>
> **Data Project**
>
> This is where I think things could be improved :-)
>
> The majority of projects I work on are data analysis projects. I.e. we
> have measurement data, do an analysis and draw conclusions, write a
> report or paper.
>
> For these projects, we tend to take a "raw data and code are real"
> approach that also implies that the raw data is never changed (with the
> only exception of renaming files - but the files I'm thinking off store
> their orignal name, so even that can be reconstructed). So we basically
> have storage and distribution needs, but not really versioning needs. We
> sometimes produce pre-processed intermediate data, but that again is
> defined by the code that produces this data from the raw data, and the
> results are considered temporary files. If I do manual curation (mostly
> excluding bad runs with certain artifacts), I produce code or data files
> that say which files were excluded and for what reason. Most of this can
> be and is done in an automated fashion, though.
>
> Producing versions of this that are to be kept (such as making snapshots
> of the state of data for a paper) is sufficiently infrequent to just zip
> those data and have the version in the file name.
>
> Recently, I tend to use nextcloud to share such data. We did use git for
> a while, but with large amounts of data that does become cumbersome, and
> we found that few collaborators were willing to learn even just the
> level of git that lets them clone and pull. Owncloud/Nextcloud is a much
> lower barrier in that respect.
>
> At the moment I think what I'd like to see would be nextcloud with
> commits, ignores and maybe a somewhat more distributed and less central
> approach ...
>
> Versioning binary data would be far more important for colleagues who
> extensively use GUI software for their analyses: not all of the relevant
> software does keep logs/recovery data (some do, though, as they are to
> be used in fields like pharma where full audit trails are required).
>
>
> **Data Projects II**
>
> (Here I see huge possibilities for improvement)
>
> OTOH, we also have some projects where it is clear that a large variety
> of subsets of the data is to be requested and analysed, and we've set up
> data bases for those purposes. Here again, I do dumps/backups, and in
> the rare occasion that a version should be tagged that can be done to
> the backup/dump. Again, these data bases are set up in a way that easily
> allows adding/inserting, but changing or deleting requires admin rights
> - and admin should make sure of the backup before doing any such
> "surgery" to the data base.
> I may say that I'm originally from a wet-lab field (chemistry): I'm
> trained to work under conditions where mistakes irretrievably mess up
> things. Version control and being able to undo mistakes is good and
> important, but if these techniques (luxuries?) are not available at
> every point, that's as it is right now.
>
> I admit that I never bothered about implementing full audit trails - and
> the supervisors I had were already suspicious whether it is worth while
> bothering to set up a data base and very much against "waste of time"
> such as (for code projects) unit testing and encapsulating code in
> packages/libraries/their own namespace...
>
> I've met one research institute, though, that run a full LIMS
> (laboratory information management system) which however, is more suited
> for situations where the same types of analyses are repeatedly done for
> new samples rather than research questions where not only samples but
> also analysis methods change from project to project.
>
>
> But e.g. RedCap https://projectredcap.org/ produces data bases with
> audit trails. (Never tried it, though).
>
>
> Best,
>
> Claudia
>
>
>
> --
>
> Claudia Beleites Chemometric Consulting
> Södeler Weg 19
> 61200 Wölfersheim
> Germany
>
> phone:  +49 (15 23) 1 83 74 18
> e-mail: [email protected]
> USt-ID: DE305606151
>
>
> ------------------------------------------
> The Carpentries: discuss
> Permalink:
> https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M278751611e28840c648e49a9
> Delivery options:
> https://carpentries.topicbox.com/groups/discuss/subscription
>

------------------------------------------
The Carpentries: discuss
Permalink: 
https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M4bc60498dcb8d4d88fce6cb6
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Re: [discuss] Version control and collaboration with large datasets.

Reply via email to