There is a lot of information here. Thanks to everyone offering insights! I'll offer some more context and detail for those who asked for my motivation for posting.
As a student in an academic science lab that uses computers and code to science, I am interested in learning and adopting the tools and practices software developers use for their work to make *good* code that's easy to share, easy for other lab members and collaborators to read and pick up, and that's resilient to my screw-ups. I.e. I'd like to do things *right* even though I don't know about anyone else on campus who does. I am very motivated by the open science phenomenon, and want the tools that are necessary to be a part of that as well. I learned git (to a point), so that's cool. Now I'm trying to prod my lab mates and advisor to pick it up too. I also started thinking, "Well what about the data? I could just gitignore it all, but sometimes it changes, branches, and needs to be reset too. And it'd be great if I didn't have to have to track that all by file names." In my current case, I'm using large (>100MB) image stacks. Versioning in this sense would ideally look something like recording a macro to track the operations done (basically diffs) between one version and the next. Probably technically impossible actually... Other data includes analysis and simulation data (.csv, .mat, etc.) This was when I posted this question. Currently, I am foraying into transitioning from having all data organized next to everything else in a file system to integrating it into databases. I am new to the database universe, so forgive me for any improper understandings here. I'm averse to SQL because I am certain that a single table would have tons of blanks, and I don't like the idea of complicated joins. I am a believer that all data should be dynamic, and by that I mean I have a vague notion that any new (or really old) data should be able to be integrated into a data model to further inform the analysis. MongoDB strikes me as a useful tool for just about all scenarios in this respect. The data repositories suggested here are certainly useful (particularly OSF), but that brings up another issue I've been thinking about, which is discoverability. As an exemplar of the kind of solution to this problem I'm interested in, take a startup company I recently learned about called BenchSci <https://www.benchsci.com/>. Though they still have errors in the reported data, they are trying to solve a big problem in data discoverability regarding the use of antibodies in research. They're making a one-stop-shop where you can see vendor data and publication data for antibodies and targets, seriously reducing the leg work needed to hunt for all this information manually, and making it less likely that a good option will go undiscovered. Back to the more general data question, with so many repository options and so many formats, they all need to be tied together somehow. There should also be a way to incorporate 'legacy' data to get data that's currently only available behind a paywall as a crappy jpg in supplemental figure 17 ... but the high res raw data might still exist on a hard drive somewhere and might be useful for some other analysis not done in the original paper. Obviously I'm starting to get a bit ahead of myself. I have a hard time not getting carried away. ------------------------------------------ The Carpentries: discuss Permalink: https://carpentries.topicbox.com/groups/discuss/Tb776978a905c0bf8-Ma044d0880bb7896449f24aed Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription
