Hi all,

A few months ago I started a thread on the Data Carpentry forums about best
practices for data management [1]. It got a small amount of traction and
cited the recent work by a few on this list on good-enough practices [2].
[2] has a section called "Version Control for Data?", which lists some
guiding principles, but doesn't have concrete prescriptions. I'm hoping to
hear from others how they choose to manage workflows for larger datasets,
both raw and refined.

To motivate the discussion, a short recap of [1]. Take for example a
climate model. The model takes as input highly detailed spatial data and
prepossesses it by aggregating to a coarser resolution (say this is binary
data like a raster). Further, it takes as input different assumptions about
the future, e.g., how GHG emissions change as a function of time and region
of the world (say this is tabluar data, like a csv or sql database). The
model will be run for a variety of "scenarios" (assumptions about the
future) and the results will be used by other modelers downstream. Finally,
the internals of the model or the input data may be updated with some
frequency and results regenerated.

Are there established best practices for how to manage and version this
kind of workflow? Some mix of version control + DOIs seems like the best
choice I can think of, but this inherently requires some amount of "rolling
your own" -- not that that's a bad thing. Does anyone have any thoughts for
how this kind of workflow scales in the input data, pre/post processing,
and number of scenario dimensions?

I'm looking forward to any feedback.

Cheers,
Matt

[1]
http://discuss.datacarpentry.org/t/data-management-best-practices-resource/67/6
[2]
http://swcarpentry.github.io/good-enough-practices-in-scientific-computing/
_______________________________________________
Discuss mailing list
[email protected]
http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org

Reply via email to