Hi all, A few months ago I started a thread on the Data Carpentry forums about best practices for data management [1]. It got a small amount of traction and cited the recent work by a few on this list on good-enough practices [2]. [2] has a section called "Version Control for Data?", which lists some guiding principles, but doesn't have concrete prescriptions. I'm hoping to hear from others how they choose to manage workflows for larger datasets, both raw and refined.
To motivate the discussion, a short recap of [1]. Take for example a climate model. The model takes as input highly detailed spatial data and prepossesses it by aggregating to a coarser resolution (say this is binary data like a raster). Further, it takes as input different assumptions about the future, e.g., how GHG emissions change as a function of time and region of the world (say this is tabluar data, like a csv or sql database). The model will be run for a variety of "scenarios" (assumptions about the future) and the results will be used by other modelers downstream. Finally, the internals of the model or the input data may be updated with some frequency and results regenerated. Are there established best practices for how to manage and version this kind of workflow? Some mix of version control + DOIs seems like the best choice I can think of, but this inherently requires some amount of "rolling your own" -- not that that's a bad thing. Does anyone have any thoughts for how this kind of workflow scales in the input data, pre/post processing, and number of scenario dimensions? I'm looking forward to any feedback. Cheers, Matt [1] http://discuss.datacarpentry.org/t/data-management-best-practices-resource/67/6 [2] http://swcarpentry.github.io/good-enough-practices-in-scientific-computing/
_______________________________________________ Discuss mailing list [email protected] http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org
