Re: [Discuss] Version Control/Workflow Management for Data

Adam Obeng Thu, 03 Mar 2016 09:24:17 -0800

I wanted to contribute this to the other thread on reproducibility, but
this is a better place for it.

I've tried a bunch of not-quite solutions to this: storing data files
with git-fat or git-annex, build/pipleline systems like make, snakemake
and sumatra. I've also tried a couple of times to roll my own by storing
hashes of datafiles and the repo state as metadata about results.

For me, the real problem isn't version control but provenance tracking:
I don't particularly care about past versions of the data, but given a
particular result I need to know what versions of intermediate data,
code, and libraries were used to create it. In all the solutions I've
seen before, this requires a layer of user discipline on top of the
tools themselves, e.g. don't run any analysis with an uncommitted repo
(but what if it takes multiple days and I'd like to be working on
something else in the meantime?).

And this is further complicated by the fact that much of my data (and
some of my code!) lives in databases.

If anyone's figured this out, I would love to know. I would be also
willing to consider some compromise in my workflow if it  would allow
something that already exists to work for me (e.g. no databases, no
concurrent analyses, or everything on one machine).

My current line of thought is this: if every result is produced by some
function that is deterministic given its inputs, it should be sufficient
to store a mapping of the inputs to the result (as in memoization). The
trick is specifying in enough detail the function's arguments and
dependencies, especially when these include input data (in databases!).

Cheers,

Adam

On Thu, Mar 3, 2016, at 11:53 AM, Byron Smith wrote:
> This is a very hard (and interesting) problem, and I don't have much
> to say about it.
>
> Instead I wanted to publicize a relevant tool I just found out about:
>
> Daff[1] is a diff for tabular data (CSVs, TSVs), which goes beyond line-
> diffs—it does *line-by-column *diffs—and integrates nicely into git.
> I've found it very helpful in updating metadata (which I do version
> control) for medium sized projects.
>
> Regarding your actual question, my current philosophy has been to
> never change raw data, and therefore not to version control it.
> Instead I back it up in a location away from my analysis (e.g. Dropbox
> works) and keep a retrieval script and an md5 hash of it with the
> project.  In a perfect world I'd have some similar mechanism for
> saving computationally expensive intermediate data files (like the
> simulation results you describe).
>
> -Byron
>
> On Thu, Mar 3, 2016 at 10:49 AM, Matthew Gidden
> <[email protected]> wrote:
>> Hi all, A few months ago I started a thread on the Data Carpentry
>> forums about best practices for data management [1]. It got a small
>> amount of traction and cited the recent work by a few on this list on
>> good-enough practices [2]. [2] has a section called "Version Control
>> for Data?", which lists some guiding principles, but doesn't have
>> concrete prescriptions. I'm hoping to hear from others how they
>> choose to manage workflows for larger datasets, both raw and refined.
>>
>> To motivate the discussion, a short recap of [1]. Take for example a
>> climate model. The model takes as input highly detailed spatial data
>> and prepossesses it by aggregating to a coarser resolution (say this
>> is binary data like a raster). Further, it takes as input different
>> assumptions about the future, e.g., how GHG emissions change as a
>> function of time and region of the world (say this is tabluar data,
>> like a csv or sql database). The model will be run for a variety of
>> "scenarios" (assumptions about the future) and the results will be
>> used by other modelers downstream. Finally, the internals of the
>> model or the input data may be updated with some frequency and
>> results regenerated. Are there established best practices for how to
>> manage and version this kind of workflow? Some mix of version control
>> + DOIs seems like the best choice I can think of, but this inherently
>> requires some amount of "rolling your own" -- not that that's a bad
>> thing. Does anyone have any thoughts for how this kind of workflow
>> scales in the input data, pre/post processing, and number of scenario
>> dimensions? I'm looking forward to any feedback.
>>
>> Cheers, Matt
>>
>> [1] 
>> http://discuss.datacarpentry.org/t/data-management-best-practices-resource/67/6
>> [2] 
>> http://swcarpentry.github.io/good-enough-practices-in-scientific-computing/
>>
>> _______________________________________________
>> Discuss mailing list [email protected]
>> http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org
> _________________________________________________
> Discuss mailing list [email protected]
> http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org

Links:

  1. http://paulfitz.github.io/daff/

_______________________________________________
Discuss mailing list
[email protected]
http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org

Re: [Discuss] Version Control/Workflow Management for Data

Reply via email to