Re: [discuss] Version control and collaboration with large datasets.

Claudia Beleites Sun, 22 Jul 2018 11:50:22 -0700

Tim, (and everyone who just has the same questions here)

OSF definitively is something I'll check out.


However, I note the privacy document explicitly spells out that as a
US-based repo, it does not meet the requirements of EU privacy
legislation (and I've been working with sensitive/patient data, so
privacy and related security aspects are an important consideration).
This, together with the experience that some research labs prefer to
keep their data in-house. My guess is, that a system that can be set up
in-house* would have much better chances to be approved by management
over here also because of legal considerations.

* or in a DMZ, giving the chance to expose their public project parts or
running two instances, an internal one very much in-house and one for
public parts that is exposed.

As OFS states it is FOSS, this should be possible, but I did not
immediately see instructions "how to run on your own server" nor
technical requirements. Could you point me to such information, or is
there even something like a "we run our own instances" user group?

Many thanks,

Clauida



Am 21.07.2018 um 19:29 schrieb Tim Head via discuss:
> Hello all,
>
> in the hopes of making it easier to use osf.io <http://osf.io> with
> large datasets last summer we* had some time and funding to start
> building http://osfclient.readthedocs.io/en/latest/cli-usage.html which
> is both a command-line program and a Python library for osf.io
> <http://osf.io>. The tool works well for gigabyte sized files and
> there is starting to be a small community of people who contribute
> fixes and new features when something they need is missing. It would
> be great to grow this further.
>
> Maybe this removes that one last hurdle that was stopping you from
> putting all your datasets on osf.io <http://osf.io> (when we asked
> about size limits they were confident no one would ever reach them ...
> and I still don't know anyone who has found it)
>
> T
>
> * we in this case is Titus Brown and me
>
> On Sat, Jul 21, 2018 at 6:29 PM Claudia Beleites
> <[email protected]
> <mailto:[email protected]>> wrote:
>
>     Hi all,
>
>     I'm also very interested in learning solutions for this. 
>
>     At the moment I distinguish two use cases:
>
>     - focus of project is coding (developing software/package/library) vs.
>
>     - focus of project is data analysis, with the sub-topic of projects
>     where various "slices" of the data are important.
>
>
>     **Code project**
>
>     I have one project where I use git-lfs on github (got a promo
>     offer for
>     free use). The project is about *code* (R package) that however
>     has some
>     100 MB binary data attached to it (it was larger at some point
>     before I
>     could get smaller but equally suitable example files for some
>     formats).
>     The binary data are example files in various file formats for the file
>     import filters the package provides. Initially, we had them in git as
>     well, but that horribly bloated the repo so it got unusable after
>     a few
>     years. The files themselves, however, hardly need any versioning.
>     I get
>     them and store them as they are, and only very occasionally is one of
>     those files replaced. The main point of the git-lfs storage is to make
>     sure that all files are where they are supposed to be without
>     having too
>     much of manual hassle.
>     At some point I was lucky to get a github promo offer for free git-lfs
>     (test) usage and gave it a try - which is the current state.
>
>     Experiences:
>
>     - (due to free promo I don't have bandwidth billing trouble)
>
>     - Git is largely independent of git-lfs: you can still fork/clone the
>     git-only part of the repo and work with that. For the project in
>     question, the files stored in git-lfs are only needed for
>     developing and
>     unit testing of file import filters, everything else does not need
>     git-lfs. I decided I don't want to force collaborators to install
>     git-lfs, so set up the project in a way that e.g. the file filter unit
>     tests check whether those files are available, and if not skips those
>     tests (visibly).
>     This does also make sense because of size restrictions for the R
>     package
>     submission to CRAN, and as I'm the maintainer in the view of CRAN,
>     I can
>     always make sure I properly run all tests.
>
>     - With this setup, I do not experience the collaboration
>     trouble/broken
>     forking issues Peter Stéphane describes in the link in Carl's mail. At
>     least not for the parts of the project that are stored as "normal"
>     git.
>     I've not yet had anyone trying to directly submit files that should go
>     into the lfs part of the repo.
>
>     - I tried to get git-lfs installed together with a private gitlab
>     instance (thinking we may want to use it for data-type projects), but
>     like Carl, I gave up. That was IIRC 3 years ago, so things may have
>     improved meanwhile.
>
>     For other "code-type" projects (model/algorithm development), I
>     tend to
>     take a two-layered approach. Data sets that are small enough to be
>     shipped as example and unit test data, say, in an R package are kept
>     with the code. In fact, many of them are toy data computed from code,
>     and I just store that code. The 2nd layer are well-known example data
>     sets, and there I simply rely on those data sets staying
>     available. (I'm
>     talking e.g. the NASA AVIRIS data sets
>     https://aviris.jpl.nasa.gov/data/free_data.html)
>     (Side note: I'm somewhat wary of papers proposing their own new
>     algorithm solely on their own data set, and of algorithm comparisons
>     based on one or few data sets)
>
>
>     **Data Project**
>
>     This is where I think things could be improved :-)
>
>     The majority of projects I work on are data analysis projects. I.e. we
>     have measurement data, do an analysis and draw conclusions, write a
>     report or paper.
>
>     For these projects, we tend to take a "raw data and code are real"
>     approach that also implies that the raw data is never changed
>     (with the
>     only exception of renaming files - but the files I'm thinking off
>     store
>     their orignal name, so even that can be reconstructed). So we
>     basically
>     have storage and distribution needs, but not really versioning
>     needs. We
>     sometimes produce pre-processed intermediate data, but that again is
>     defined by the code that produces this data from the raw data, and the
>     results are considered temporary files. If I do manual curation
>     (mostly
>     excluding bad runs with certain artifacts), I produce code or data
>     files
>     that say which files were excluded and for what reason. Most of
>     this can
>     be and is done in an automated fashion, though.
>
>     Producing versions of this that are to be kept (such as making
>     snapshots
>     of the state of data for a paper) is sufficiently infrequent to
>     just zip
>     those data and have the version in the file name.
>
>     Recently, I tend to use nextcloud to share such data. We did use
>     git for
>     a while, but with large amounts of data that does become
>     cumbersome, and
>     we found that few collaborators were willing to learn even just the
>     level of git that lets them clone and pull. Owncloud/Nextcloud is
>     a much
>     lower barrier in that respect.
>
>     At the moment I think what I'd like to see would be nextcloud with
>     commits, ignores and maybe a somewhat more distributed and less
>     central
>     approach ...
>
>     Versioning binary data would be far more important for colleagues who
>     extensively use GUI software for their analyses: not all of the
>     relevant
>     software does keep logs/recovery data (some do, though, as they are to
>     be used in fields like pharma where full audit trails are required).
>
>
>     **Data Projects II**
>
>     (Here I see huge possibilities for improvement)
>
>     OTOH, we also have some projects where it is clear that a large
>     variety
>     of subsets of the data is to be requested and analysed, and we've
>     set up
>     data bases for those purposes. Here again, I do dumps/backups, and in
>     the rare occasion that a version should be tagged that can be done to
>     the backup/dump. Again, these data bases are set up in a way that
>     easily
>     allows adding/inserting, but changing or deleting requires admin
>     rights
>     - and admin should make sure of the backup before doing any such
>     "surgery" to the data base.
>     I may say that I'm originally from a wet-lab field (chemistry): I'm
>     trained to work under conditions where mistakes irretrievably mess up
>     things. Version control and being able to undo mistakes is good and
>     important, but if these techniques (luxuries?) are not available at
>     every point, that's as it is right now.
>
>     I admit that I never bothered about implementing full audit trails
>     - and
>     the supervisors I had were already suspicious whether it is worth
>     while
>     bothering to set up a data base and very much against "waste of time"
>     such as (for code projects) unit testing and encapsulating code in
>     packages/libraries/their own namespace...
>
>     I've met one research institute, though, that run a full LIMS
>     (laboratory information management system) which however, is more
>     suited
>     for situations where the same types of analyses are repeatedly
>     done for
>     new samples rather than research questions where not only samples but
>     also analysis methods change from project to project.
>
>
>     But e.g. RedCap https://projectredcap.org/ produces data bases with
>     audit trails. (Never tried it, though).
>
>
>     Best,
>
>     Claudia
>
>
>
>     -- 
>
>     Claudia Beleites Chemometric Consulting
>     Södeler Weg 19
>     61200 Wölfersheim
>     Germany
>
>     phone:  +49 (15 23) 1 83 74 18
>     e-mail: [email protected]
>     <mailto:[email protected]>
>     USt-ID: DE305606151
>
>
>     ------------------------------------------
>     The Carpentries: discuss
>     Permalink:
>     
> https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M278751611e28840c648e49a9
>     Delivery options:
>     https://carpentries.topicbox.com/groups/discuss/subscription
>
> *The Carpentries <https://carpentries.topicbox.com/latest>* / discuss
> / see discussions <https://carpentries.topicbox.com/groups/discuss> +
> participants <https://carpentries.topicbox.com/groups/discuss/members>
> + delivery options
> <https://carpentries.topicbox.com/groups/discuss/subscription>
> Permalink
> <https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M4bc60498dcb8d4d88fce6cb6>
>

-- 
Claudia Beleites Chemometric Consulting
Södeler Weg 19
61200 Wölfersheim
Germany

phone:  +49 (15 23) 1 83 74 18
e-mail: [email protected]
USt-ID: DE305606151


------------------------------------------
The Carpentries: discuss
Permalink: 
https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Mc4ed5415923925413699397a
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Re: [discuss] Version control and collaboration with large datasets.

Reply via email to