Tim, (and everyone who just has the same questions here) OSF definitively is something I'll check out.
However, I note the privacy document explicitly spells out that as a US-based repo, it does not meet the requirements of EU privacy legislation (and I've been working with sensitive/patient data, so privacy and related security aspects are an important consideration). This, together with the experience that some research labs prefer to keep their data in-house. My guess is, that a system that can be set up in-house* would have much better chances to be approved by management over here also because of legal considerations. * or in a DMZ, giving the chance to expose their public project parts or running two instances, an internal one very much in-house and one for public parts that is exposed. As OFS states it is FOSS, this should be possible, but I did not immediately see instructions "how to run on your own server" nor technical requirements. Could you point me to such information, or is there even something like a "we run our own instances" user group? Many thanks, Clauida Am 21.07.2018 um 19:29 schrieb Tim Head via discuss: > Hello all, > > in the hopes of making it easier to use osf.io <http://osf.io> with > large datasets last summer we* had some time and funding to start > building http://osfclient.readthedocs.io/en/latest/cli-usage.html which > is both a command-line program and a Python library for osf.io > <http://osf.io>. The tool works well for gigabyte sized files and > there is starting to be a small community of people who contribute > fixes and new features when something they need is missing. It would > be great to grow this further. > > Maybe this removes that one last hurdle that was stopping you from > putting all your datasets on osf.io <http://osf.io> (when we asked > about size limits they were confident no one would ever reach them ... > and I still don't know anyone who has found it) > > T > > * we in this case is Titus Brown and me > > On Sat, Jul 21, 2018 at 6:29 PM Claudia Beleites > <[email protected] > <mailto:[email protected]>> wrote: > > Hi all, > > I'm also very interested in learning solutions for this. > > At the moment I distinguish two use cases: > > - focus of project is coding (developing software/package/library) vs. > > - focus of project is data analysis, with the sub-topic of projects > where various "slices" of the data are important. > > > **Code project** > > I have one project where I use git-lfs on github (got a promo > offer for > free use). The project is about *code* (R package) that however > has some > 100 MB binary data attached to it (it was larger at some point > before I > could get smaller but equally suitable example files for some > formats). > The binary data are example files in various file formats for the file > import filters the package provides. Initially, we had them in git as > well, but that horribly bloated the repo so it got unusable after > a few > years. The files themselves, however, hardly need any versioning. > I get > them and store them as they are, and only very occasionally is one of > those files replaced. The main point of the git-lfs storage is to make > sure that all files are where they are supposed to be without > having too > much of manual hassle. > At some point I was lucky to get a github promo offer for free git-lfs > (test) usage and gave it a try - which is the current state. > > Experiences: > > - (due to free promo I don't have bandwidth billing trouble) > > - Git is largely independent of git-lfs: you can still fork/clone the > git-only part of the repo and work with that. For the project in > question, the files stored in git-lfs are only needed for > developing and > unit testing of file import filters, everything else does not need > git-lfs. I decided I don't want to force collaborators to install > git-lfs, so set up the project in a way that e.g. the file filter unit > tests check whether those files are available, and if not skips those > tests (visibly). > This does also make sense because of size restrictions for the R > package > submission to CRAN, and as I'm the maintainer in the view of CRAN, > I can > always make sure I properly run all tests. > > - With this setup, I do not experience the collaboration > trouble/broken > forking issues Peter Stéphane describes in the link in Carl's mail. At > least not for the parts of the project that are stored as "normal" > git. > I've not yet had anyone trying to directly submit files that should go > into the lfs part of the repo. > > - I tried to get git-lfs installed together with a private gitlab > instance (thinking we may want to use it for data-type projects), but > like Carl, I gave up. That was IIRC 3 years ago, so things may have > improved meanwhile. > > For other "code-type" projects (model/algorithm development), I > tend to > take a two-layered approach. Data sets that are small enough to be > shipped as example and unit test data, say, in an R package are kept > with the code. In fact, many of them are toy data computed from code, > and I just store that code. The 2nd layer are well-known example data > sets, and there I simply rely on those data sets staying > available. (I'm > talking e.g. the NASA AVIRIS data sets > https://aviris.jpl.nasa.gov/data/free_data.html) > (Side note: I'm somewhat wary of papers proposing their own new > algorithm solely on their own data set, and of algorithm comparisons > based on one or few data sets) > > > **Data Project** > > This is where I think things could be improved :-) > > The majority of projects I work on are data analysis projects. I.e. we > have measurement data, do an analysis and draw conclusions, write a > report or paper. > > For these projects, we tend to take a "raw data and code are real" > approach that also implies that the raw data is never changed > (with the > only exception of renaming files - but the files I'm thinking off > store > their orignal name, so even that can be reconstructed). So we > basically > have storage and distribution needs, but not really versioning > needs. We > sometimes produce pre-processed intermediate data, but that again is > defined by the code that produces this data from the raw data, and the > results are considered temporary files. If I do manual curation > (mostly > excluding bad runs with certain artifacts), I produce code or data > files > that say which files were excluded and for what reason. Most of > this can > be and is done in an automated fashion, though. > > Producing versions of this that are to be kept (such as making > snapshots > of the state of data for a paper) is sufficiently infrequent to > just zip > those data and have the version in the file name. > > Recently, I tend to use nextcloud to share such data. We did use > git for > a while, but with large amounts of data that does become > cumbersome, and > we found that few collaborators were willing to learn even just the > level of git that lets them clone and pull. Owncloud/Nextcloud is > a much > lower barrier in that respect. > > At the moment I think what I'd like to see would be nextcloud with > commits, ignores and maybe a somewhat more distributed and less > central > approach ... > > Versioning binary data would be far more important for colleagues who > extensively use GUI software for their analyses: not all of the > relevant > software does keep logs/recovery data (some do, though, as they are to > be used in fields like pharma where full audit trails are required). > > > **Data Projects II** > > (Here I see huge possibilities for improvement) > > OTOH, we also have some projects where it is clear that a large > variety > of subsets of the data is to be requested and analysed, and we've > set up > data bases for those purposes. Here again, I do dumps/backups, and in > the rare occasion that a version should be tagged that can be done to > the backup/dump. Again, these data bases are set up in a way that > easily > allows adding/inserting, but changing or deleting requires admin > rights > - and admin should make sure of the backup before doing any such > "surgery" to the data base. > I may say that I'm originally from a wet-lab field (chemistry): I'm > trained to work under conditions where mistakes irretrievably mess up > things. Version control and being able to undo mistakes is good and > important, but if these techniques (luxuries?) are not available at > every point, that's as it is right now. > > I admit that I never bothered about implementing full audit trails > - and > the supervisors I had were already suspicious whether it is worth > while > bothering to set up a data base and very much against "waste of time" > such as (for code projects) unit testing and encapsulating code in > packages/libraries/their own namespace... > > I've met one research institute, though, that run a full LIMS > (laboratory information management system) which however, is more > suited > for situations where the same types of analyses are repeatedly > done for > new samples rather than research questions where not only samples but > also analysis methods change from project to project. > > > But e.g. RedCap https://projectredcap.org/ produces data bases with > audit trails. (Never tried it, though). > > > Best, > > Claudia > > > > -- > > Claudia Beleites Chemometric Consulting > Södeler Weg 19 > 61200 Wölfersheim > Germany > > phone: +49 (15 23) 1 83 74 18 > e-mail: [email protected] > <mailto:[email protected]> > USt-ID: DE305606151 > > > ------------------------------------------ > The Carpentries: discuss > Permalink: > > https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M278751611e28840c648e49a9 > Delivery options: > https://carpentries.topicbox.com/groups/discuss/subscription > > *The Carpentries <https://carpentries.topicbox.com/latest>* / discuss > / see discussions <https://carpentries.topicbox.com/groups/discuss> + > participants <https://carpentries.topicbox.com/groups/discuss/members> > + delivery options > <https://carpentries.topicbox.com/groups/discuss/subscription> > Permalink > <https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-M4bc60498dcb8d4d88fce6cb6> > -- Claudia Beleites Chemometric Consulting Södeler Weg 19 61200 Wölfersheim Germany phone: +49 (15 23) 1 83 74 18 e-mail: [email protected] USt-ID: DE305606151 ------------------------------------------ The Carpentries: discuss Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Mc4ed5415923925413699397a Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription
