Hi Greg, I've been doing a bunch of work where we really want to track which version of things were used, and also the basic methods, and finally what input data was used.
What I ended up implementing was a bunch of JSON metadata. At raw file copying, a JSON metadata file gets generated with the original location, new location, and the SHA256 checksum of the original data. After a transformation step, more metadata is extracted during processing, including which version of the software did the conversion. This gets added to the raw copying metadata. A final processing occurs using an custom R package, and for that package, I have a git hook that increments the package version on every commit, so every commit has a corresponding version number. Also, because it is a local custom pkg, if I have a clean git repo, the git SHA is added to the pkg metadata at install. During processing, I have a function that gets the parent pkg metadata, including the SHA if it exists, and adds it to another JSON metadata file. It also takes the main data class object that gets processed, strips the data bits, and writes a JSON representation of the main class (so all the methods are encoded as JSON), which also becomes part of the JSON metadata, in addition to saving a binary representation of the class with the data attached. All this is added to the previously existing metadata. Ideally, I would be capturing the version numbers for all of the pkg's that my code imports as well, but I haven't gone that far. Cheers, -Robert On Sun, Aug 12, 2018 at 7:40 PM Damien Irving via discuss < [email protected]> wrote: > Hi Greg, > > I've written a Data Carpentry lesson on data provenance, which makes use > of a very simple package I've written called cmdline-provenance: > > - Lesson: > https://data-lessons.github.io/python-aos-lesson/09-provenance/index.html > - Package: http://cmdline-provenance.readthedocs.io/en/latest/ > > > Cheers, > Damien > > On Sun, Aug 12, 2018 at 9:13 AM, Greg Wilson <[email protected]> > wrote: > >> Hi, >> >> Back in the Stone Age, Software Carpentry's lessons spent a few minutes >> discussing data provenance: >> >> - Include the string '$Id:$' in every source code file - Subversion would >> automatically fill in the revision ID on every commit to turn it into >> something like '$Id: 12345'. >> >> - Print the script's name, the commit ID, and the date in the header of >> every output file (along with all the parameters used by the script). >> >> It wasn't much, and I don't know how many people ever actually >> implemented it, but it did allow you to keep track of which versions of >> which scripts had generated which output files in a systematic way. >> >> So here we are today in what I hope is research computing's Bronze Age, >> and I'm curious: what do you all actually do to keep track of data >> provenance? What tools or methods do you use to record which programs >> produced which output files from which input files with which settings and >> parameters? I was excited about the Open Provenance effort circa 2006-07 ( >> https://openprovenance.org/opm/), but it never seemed to catch on. What >> are people using instead? >> >> Thanks, >> >> Greg >> >> -- >> If you cannot be brave – and it is often hard to be brave – be kind. >> >> >> ------------------------------------------ >> The Carpentries: discuss >> Permalink: >> https://carpentries.topicbox.com/groups/discuss/Te1cade367c0ab4ee-M703907d77763bffcdf143f1c >> Delivery options: >> https://carpentries.topicbox.com/groups/discuss/subscription >> > *The Carpentries <https://carpentries.topicbox.com/latest>* / discuss / > see discussions <https://carpentries.topicbox.com/groups/discuss> + > participants <https://carpentries.topicbox.com/groups/discuss/members> + > delivery > options <https://carpentries.topicbox.com/groups/discuss/subscription> > Permalink > <https://carpentries.topicbox.com/groups/discuss/Te1cade367c0ab4ee-M82cc825372d1d6b771e26258> > ------------------------------------------ The Carpentries: discuss Permalink: https://carpentries.topicbox.com/groups/discuss/Te1cade367c0ab4ee-M5aa3b948e50783c4ea461204 Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription
