Re: [discuss] Version control and collaboration with large datasets.

thompson.m.j via discuss Fri, 10 Aug 2018 14:23:55 -0700

There is a lot of information here. Thanks to everyone offering insights! I'll 
offer some more context and detail for those who asked for my motivation for 
posting.

As a student in an academic science lab that uses computers and code to
science, I am interested in learning and adopting the tools and practices
software developers use for their work to make *good* code that's easy to
share, easy for other lab members and collaborators to read and pick up, and
that's resilient to my screw-ups. I.e. I'd like to do things *right* even
though I don't know about anyone else on campus who does. I am very motivated
by the open science phenomenon, and want the tools that are necessary to be a
part of that as well.

I learned git (to a point), so that's cool. Now I'm trying to prod my lab mates
and advisor to pick it up too. I also started thinking, "Well what about the
data? I could just gitignore it all, but sometimes it changes, branches, and
needs to be reset too. And it'd be great if I didn't have to have to track that
all by file names." In my current case, I'm using large (>100MB) image
stacks. Versioning in this sense would ideally look something like recording a
macro to track the operations done (basically diffs) between one version and
the next. Probably technically impossible actually... Other data includes
analysis and simulation data (.csv, .mat, etc.) This was when I posted this
question.

Currently, I am foraying into transitioning from having all data organized next
to everything else in a file system to integrating it into databases. I am new
to the database universe, so forgive me for any improper understandings here.
I'm averse to SQL because I am certain that a single table would have tons of
blanks, and I don't like the idea of complicated joins. I am a believer that
all data should be dynamic, and by that I mean I have a vague notion that any
new (or really old) data should be able to be integrated into a data model to
further inform the analysis. MongoDB strikes me as a useful tool for just about
all scenarios in this respect.

The data repositories suggested here are certainly useful (particularly OSF),
but that brings up another issue I've been thinking about, which is
discoverability. As an exemplar of the kind of solution to this problem I'm
interested in, take a startup company I recently learned about called BenchSci
<https://www.benchsci.com/>. Though they still have errors in the reported
data, they are trying to solve a big problem in data discoverability regarding
the use of antibodies in research. They're making a one-stop-shop where you can
see vendor data and publication data for antibodies and targets, seriously
reducing the leg work needed to hunt for all this information manually, and
making it less likely that a good option will go undiscovered. Back to the more
general data question, with so many repository options and so many formats,
they all need to be tied together somehow. There should also be a way to
incorporate 'legacy' data to get data that's currently only available behind a
paywall as a crappy jpg in supplemental figure 17 ... but the high res raw data
might still exist on a hard drive somewhere and might be useful for some other
analysis not done in the original paper.

Obviously I'm starting to get a bit ahead of myself. I have a hard time not
getting carried away.
------------------------------------------
The Carpentries: discuss
Permalink:
https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Ma044d0880bb7896449f24aed
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Re: [discuss] Version control and collaboration with large datasets.

Reply via email to