Hi Greg,

I've been doing a bunch of work where we really want to track which version
of things were used, and also the basic methods, and finally what input
data was used.

What I ended up implementing was a bunch of JSON metadata.

At raw file copying, a JSON metadata file gets generated with the original
location, new location, and the SHA256 checksum of the original data.

After a transformation step, more metadata is extracted during processing,
including which version of the software did the conversion. This gets added
to the raw copying metadata.

A final processing occurs using an custom R package, and for that package,
I have a git hook that increments the package version on every commit, so
every commit has a corresponding version number. Also, because it is a
local custom pkg, if I have a clean git repo, the git SHA is added to the
pkg metadata at install. During processing, I have a function that gets the
parent pkg metadata, including the SHA if it exists, and adds it to another
JSON metadata file. It also takes the main data class object that gets
processed, strips the data bits, and writes a JSON representation of the
main class (so all the methods are encoded as JSON), which also becomes
part of the JSON metadata, in addition to saving a binary representation of
the class with the data attached. All this is added to the previously
existing metadata.

Ideally, I would be capturing the version numbers for all of the pkg's that
my code imports as well, but I haven't gone that far.

Cheers,

-Robert

On Sun, Aug 12, 2018 at 7:40 PM Damien Irving via discuss <
[email protected]> wrote:

> Hi Greg,
>
> I've written a Data Carpentry lesson on data provenance, which makes use
> of a very simple package I've written called cmdline-provenance:
>
>    - Lesson:
>    https://data-lessons.github.io/python-aos-lesson/09-provenance/index.html
>    - Package: http://cmdline-provenance.readthedocs.io/en/latest/
>
>
> Cheers,
> Damien
>
> On Sun, Aug 12, 2018 at 9:13 AM, Greg Wilson <[email protected]>
> wrote:
>
>> Hi,
>>
>> Back in the Stone Age, Software Carpentry's lessons spent a few minutes
>> discussing data provenance:
>>
>> - Include the string '$Id:$' in every source code file - Subversion would
>> automatically fill in the revision ID on every commit to turn it into
>> something like '$Id: 12345'.
>>
>> - Print the script's name, the commit ID, and the date in the header of
>> every output file (along with all the parameters used by the script).
>>
>> It wasn't much, and I don't know how many people ever actually
>> implemented it, but it did allow you to keep track of which versions of
>> which scripts had generated which output files in a systematic way.
>>
>> So here we are today in what I hope is research computing's Bronze Age,
>> and I'm curious: what do you all actually do to keep track of data
>> provenance?  What tools or methods do you use to record which programs
>> produced which output files from which input files with which settings and
>> parameters?  I was excited about the Open Provenance effort circa 2006-07 (
>> https://openprovenance.org/opm/), but it never seemed to catch on.  What
>> are people using instead?
>>
>> Thanks,
>>
>> Greg
>>
>> --
>> If you cannot be brave – and it is often hard to be brave – be kind.
>>
>>
>> ------------------------------------------
>> The Carpentries: discuss
>> Permalink:
>> https://carpentries.topicbox.com/groups/discuss/Te1cade367c0ab4ee-M703907d77763bffcdf143f1c
>> Delivery options:
>> https://carpentries.topicbox.com/groups/discuss/subscription
>>
> *The Carpentries <https://carpentries.topicbox.com/latest>* / discuss /
> see discussions <https://carpentries.topicbox.com/groups/discuss> +
> participants <https://carpentries.topicbox.com/groups/discuss/members> + 
> delivery
> options <https://carpentries.topicbox.com/groups/discuss/subscription>
> Permalink
> <https://carpentries.topicbox.com/groups/discuss/Te1cade367c0ab4ee-M82cc825372d1d6b771e26258>
>

------------------------------------------
The Carpentries: discuss
Permalink: 
https://carpentries.topicbox.com/groups/discuss/Te1cade367c0ab4ee-M5aa3b948e50783c4ea461204
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Reply via email to