This is a very hard (and interesting) problem, and I don't have much to say
about it.

Instead I wanted to publicize a relevant tool I just found out about:

Daff <http://paulfitz.github.io/daff/> is a diff for tabular data (CSVs,
TSVs), which goes beyond line-diffs—it does *line-by-column *diffs—and
integrates nicely into git.  I've found it very helpful in updating
metadata (which I do version control) for medium sized projects.

Regarding your actual question, my current philosophy has been to never
change raw data, and therefore not to version control it.  Instead I back
it up in a location away from my analysis (e.g. Dropbox works) and keep a
retrieval script and an md5 hash of it with the project.  In a perfect
world I'd have some similar mechanism for saving computationally expensive
intermediate data files (like the simulation results you describe).

-Byron

On Thu, Mar 3, 2016 at 10:49 AM, Matthew Gidden <[email protected]>
wrote:

> Hi all,
>
> A few months ago I started a thread on the Data Carpentry forums about
> best practices for data management [1]. It got a small amount of traction
> and cited the recent work by a few on this list on good-enough practices
> [2]. [2] has a section called "Version Control for Data?", which lists some
> guiding principles, but doesn't have concrete prescriptions. I'm hoping to
> hear from others how they choose to manage workflows for larger datasets,
> both raw and refined.
>
> To motivate the discussion, a short recap of [1]. Take for example a
> climate model. The model takes as input highly detailed spatial data and
> prepossesses it by aggregating to a coarser resolution (say this is binary
> data like a raster). Further, it takes as input different assumptions about
> the future, e.g., how GHG emissions change as a function of time and region
> of the world (say this is tabluar data, like a csv or sql database). The
> model will be run for a variety of "scenarios" (assumptions about the
> future) and the results will be used by other modelers downstream. Finally,
> the internals of the model or the input data may be updated with some
> frequency and results regenerated.
>
> Are there established best practices for how to manage and version this
> kind of workflow? Some mix of version control + DOIs seems like the best
> choice I can think of, but this inherently requires some amount of "rolling
> your own" -- not that that's a bad thing. Does anyone have any thoughts for
> how this kind of workflow scales in the input data, pre/post processing,
> and number of scenario dimensions?
>
> I'm looking forward to any feedback.
>
> Cheers,
> Matt
>
> [1]
> http://discuss.datacarpentry.org/t/data-management-best-practices-resource/67/6
> [2]
> http://swcarpentry.github.io/good-enough-practices-in-scientific-computing/
>
> _______________________________________________
> Discuss mailing list
> [email protected]
>
> http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org
>
_______________________________________________
Discuss mailing list
[email protected]
http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org

Reply via email to