There is a lot of information here. Thanks to everyone offering insights! I'll 
offer some more context and detail for those who asked for my motivation for 
posting. 

As a student in an academic science lab that uses computers and code to 
science, I am interested in learning and adopting the tools and practices 
software developers use for their work to make *good* code that's easy to 
share, easy for other lab members and collaborators to read and pick up, and 
that's resilient to my screw-ups. I.e. I'd like to do things *right* even 
though I don't know about anyone else on campus who does. I am very motivated 
by the open science phenomenon, and want the tools that are necessary to be a 
part of that as well. 

I learned git (to a point), so that's cool. Now I'm trying to prod my lab mates 
and advisor to pick it up too. I also started thinking, "Well what about the 
data? I could just gitignore it all, but sometimes it changes, branches, and 
needs to be reset too. And it'd be great if I didn't have to have to track that 
all by file names." In my current case, I'm using  large (>100MB)  image 
stacks. Versioning in this sense would ideally look something like recording a 
macro to track the operations done (basically diffs) between one version and 
the next. Probably technically impossible actually... Other data includes 
analysis and simulation data (.csv, .mat, etc.) This was when I posted this 
question.

Currently, I am foraying into transitioning from having all data organized next 
to everything else in a file system to integrating it into databases. I am new 
to the database universe, so forgive me for any improper understandings here. 
I'm averse to SQL because I am certain that a single table would have tons of 
blanks, and I don't like the idea of complicated joins. I am a believer that 
all data should be dynamic, and by that I mean  I have a vague notion that any 
new (or really old) data should be able to be integrated into a data model to 
further inform the analysis. MongoDB strikes me as a useful tool for just about 
all scenarios in this respect.

The data repositories suggested here are certainly useful (particularly OSF), 
but that brings up another issue I've been thinking about, which is 
discoverability. As an exemplar of the kind of solution to this problem I'm 
interested in, take a startup company I recently learned about called BenchSci 
<https://www.benchsci.com/>. Though they still have errors in the reported 
data, they are trying to solve a big problem in data discoverability regarding 
the use of antibodies in research. They're making a one-stop-shop where you can 
see vendor data and publication data for antibodies and targets, seriously 
reducing the leg work needed to hunt for all this information manually, and 
making it less likely that a good option will go undiscovered. Back to the more 
general data question, with so many repository options and so many formats, 
they all need to be tied together somehow. There should also be a way to 
incorporate 'legacy' data to get data that's currently only available behind a 
paywall as a crappy jpg in supplemental figure 17 ... but the high res raw data 
might still exist on a hard drive somewhere and might be useful for some other 
analysis not done in the original paper.

Obviously I'm starting to get a bit ahead of myself. I have a hard time not 
getting carried away.
------------------------------------------
The Carpentries: discuss
Permalink: 
https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Ma044d0880bb7896449f24aed
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Reply via email to