Hi all, I'm also very interested in learning solutions for this.
At the moment I distinguish two use cases: - focus of project is coding (developing software/package/library) vs. - focus of project is data analysis, with the sub-topic of projects where various "slices" of the data are important. **Code project** I have one project where I use git-lfs on github (got a promo offer for free use). The project is about *code* (R package) that however has some 100 MB binary data attached to it (it was larger at some point before I could get smaller but equally suitable example files for some formats). The binary data are example files in various file formats for the file import filters the package provides. Initially, we had them in git as well, but that horribly bloated the repo so it got unusable after a few years. The files themselves, however, hardly need any versioning. I get them and store them as they are, and only very occasionally is one of those files replaced. The main point of the git-lfs storage is to make sure that all files are where they are supposed to be without having too much of manual hassle. At some point I was lucky to get a github promo offer for free git-lfs (test) usage and gave it a try - which is the current state. Experiences: - (due to free promo I don't have bandwidth billing trouble) - Git is largely independent of git-lfs: you can still fork/clone the git-only part of the repo and work with that. For the project in question, the files stored in git-lfs are only needed for developing and unit testing of file import filters, everything else does not need git-lfs. I decided I don't want to force collaborators to install git-lfs, so set up the project in a way that e.g. the file filter unit tests check whether those files are available, and if not skips those tests (visibly). This does also make sense because of size restrictions for the R package submission to CRAN, and as I'm the maintainer in the view of CRAN, I can always make sure I properly run all tests. - With this setup, I do not experience the collaboration trouble/broken forking issues Peter Stéphane describes in the link in Carl's mail. At least not for the parts of the project that are stored as "normal" git. I've not yet had anyone trying to directly submit files that should go into the lfs part of the repo. - I tried to get git-lfs installed together with a private gitlab instance (thinking we may want to use it for data-type projects), but like Carl, I gave up. That was IIRC 3 years ago, so things may have improved meanwhile. For other "code-type" projects (model/algorithm development), I tend to take a two-layered approach. Data sets that are small enough to be shipped as example and unit test data, say, in an R package are kept with the code. In fact, many of them are toy data computed from code, and I just store that code. The 2nd layer are well-known example data sets, and there I simply rely on those data sets staying available. (I'm talking e.g. the NASA AVIRIS data sets https://aviris.jpl.nasa.gov/data/free_data.html) (Side note: I'm somewhat wary of papers proposing their own new algorithm solely on their own data set, and of algorithm comparisons based on one or few data sets) **Data Project** This is where I think things could be improved :-) The majority of projects I work on are data analysis projects. I.e. we have measurement data, do an analysis and draw conclusions, write a report or paper. For these projects, we tend to take a "raw data and code are real" approach that also implies that the raw data is never changed (with the only exception of renaming files - but the files I'm thinking off store their orignal name, so even that can be reconstructed). So we basically have storage and distribution needs, but not really versioning needs. We sometimes produce pre-processed intermediate data, but that again is defined by the code that produces this data from the raw data, and the results are considered temporary files. If I do manual curation (mostly excluding bad runs with certain artifacts), I produce code or data files that say which files were excluded and for what reason. Most of this can be and is done in an automated fashion, though. Producing versions of this that are to be kept (such as making snapshots of the state of data for a paper) is sufficiently infrequent to just zip those data and have the version in the file name. Recently, I tend to use nextcloud to share such data. We did use git for a while, but with large amounts of data that does become cumbersome, and we found that few collaborators were willing to learn even just the level of git that lets them clone and pull. Owncloud/Nextcloud is a much lower barrier in that respect. At the moment I think what I'd like to see would be nextcloud with commits, ignores and maybe a somewhat more distributed and less central approach ... Versioning binary data would be far more important for colleagues who extensively use GUI software for their analyses: not all of the relevant software does keep logs/recovery data (some do, though, as they are to be used in fields like pharma where full audit trails are required). **Data Projects II** (Here I see huge possibilities for improvement) OTOH, we also have some projects where it is clear that a large variety of subsets of the data is to be requested and analysed, and we've set up data bases for those purposes. Here again, I do dumps/backups, and in the rare occasion that a version should be tagged that can be done to the backup/dump. Again, these data bases are set up in a way that easily allows adding/inserting, but changing or deleting requires admin rights - and admin should make sure of the backup before doing any such "surgery" to the data base. I may say that I'm originally from a wet-lab field (chemistry): I'm trained to work under conditions where mistakes irretrievably mess up things. Version control and being able to undo mistakes is good and important, but if these techniques (luxuries?) are not available at every point, that's as it is right now. I admit that I never bothered about implementing full audit trails - and the supervisors I had were already suspicious whether it is worth while bothering to set up a data base and very much against "waste of time" such as (for code projects) unit testing and encapsulating code in packages/libraries/their own namespace... I've met one research institute, though, that run a full LIMS (laboratory information management system) which however, is more suited for situations where the same types of analyses are repeatedly done for new samples rather than research questions where not only samples but also analysis methods change from project to project. But e.g. RedCap https://projectredcap.org/ produces data bases with audit trails. (Never tried it, though). Best, Claudia -- Claudia Beleites Chemometric Consulting Södeler Weg 19 61200 Wölfersheim Germany phone: +49 (15 23) 1 83 74 18 e-mail: [email protected] USt-ID: DE305606151 ------------------------------------------ The Carpentries: discuss Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M278751611e28840c648e49a9 Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription
