This is a very hard (and interesting) problem, and I don't have much to say about it.
Instead I wanted to publicize a relevant tool I just found out about: Daff <http://paulfitz.github.io/daff/> is a diff for tabular data (CSVs, TSVs), which goes beyond line-diffs—it does *line-by-column *diffs—and integrates nicely into git. I've found it very helpful in updating metadata (which I do version control) for medium sized projects. Regarding your actual question, my current philosophy has been to never change raw data, and therefore not to version control it. Instead I back it up in a location away from my analysis (e.g. Dropbox works) and keep a retrieval script and an md5 hash of it with the project. In a perfect world I'd have some similar mechanism for saving computationally expensive intermediate data files (like the simulation results you describe). -Byron On Thu, Mar 3, 2016 at 10:49 AM, Matthew Gidden <[email protected]> wrote: > Hi all, > > A few months ago I started a thread on the Data Carpentry forums about > best practices for data management [1]. It got a small amount of traction > and cited the recent work by a few on this list on good-enough practices > [2]. [2] has a section called "Version Control for Data?", which lists some > guiding principles, but doesn't have concrete prescriptions. I'm hoping to > hear from others how they choose to manage workflows for larger datasets, > both raw and refined. > > To motivate the discussion, a short recap of [1]. Take for example a > climate model. The model takes as input highly detailed spatial data and > prepossesses it by aggregating to a coarser resolution (say this is binary > data like a raster). Further, it takes as input different assumptions about > the future, e.g., how GHG emissions change as a function of time and region > of the world (say this is tabluar data, like a csv or sql database). The > model will be run for a variety of "scenarios" (assumptions about the > future) and the results will be used by other modelers downstream. Finally, > the internals of the model or the input data may be updated with some > frequency and results regenerated. > > Are there established best practices for how to manage and version this > kind of workflow? Some mix of version control + DOIs seems like the best > choice I can think of, but this inherently requires some amount of "rolling > your own" -- not that that's a bad thing. Does anyone have any thoughts for > how this kind of workflow scales in the input data, pre/post processing, > and number of scenario dimensions? > > I'm looking forward to any feedback. > > Cheers, > Matt > > [1] > http://discuss.datacarpentry.org/t/data-management-best-practices-resource/67/6 > [2] > http://swcarpentry.github.io/good-enough-practices-in-scientific-computing/ > > _______________________________________________ > Discuss mailing list > [email protected] > > http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org >
_______________________________________________ Discuss mailing list [email protected] http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org
