Re: [discuss] tracking data provenance

naupaka via discuss Sun, 12 Aug 2018 06:57:46 -0700

I remember playing with Sumatra several years ago. I believe the approach is to 
track all that metadata in a SQLite db and then make it browsable/accessible 
with a Django web app.


http://neuralensemble.org/sumatra/

In the R world many folks have taken to appending `sessionInfo()` or 
`devtools::session_info()` to the end of an Rmd file to track packages 
attached, etc. The latter also gives SHAs for packages installed from GitHub. 
Wouldn’t be that hard to also start including a shell chunk with `git rev-parse 
HEAD` to include the local repo commit info.

Here’s the old discussion on this I remember from several years ago:
https://github.com/swcarpentry/DEPRECATED-site/issues/1085

Best,
Naupaka

> On Aug 12, 2018, at 6:30 AM, Bruce Becker via discuss 
> <[email protected]> wrote:
> 
> Hi Greg, all
> I'm not sure about the Bronze Age, but in the Baroque era my understanding is 
> that this is the job of metadata. You need a lot of machinery to do this, but 
> in this era, data never lives "nakedly", but it always accompanied by 
> metadata which describes it. So, you look up data by it's persistent 
> identifier, in repositories, and deposit it, along with it's changelog or 
> whatever, in repositories.
> 
> I am the first to concede that many, if not the vast majority of data 
> civilisations will ever reach the Baroque age - and perhaps others will skip 
> it altogether, but this happens to be the civilisation I'm writing to you 
> from. I'd hazard the suggestion that the Baroque Age is also known as the 
> Open Science age, just to be prickly.
> 
> Have a great sunday!
> Bruce
> 
>> On Sun, 12 Aug 2018 at 15:15, Greg Wilson <[email protected]> wrote:
>> Hi,
>> 
>> Back in the Stone Age, Software Carpentry's lessons spent a few minutes
>> discussing data provenance:
>> 
>> - Include the string '$Id:$' in every source code file - Subversion
>> would automatically fill in the revision ID on every commit to turn it
>> into something like '$Id: 12345'.
>> 
>> - Print the script's name, the commit ID, and the date in the header of
>> every output file (along with all the parameters used by the script).
>> 
>> It wasn't much, and I don't know how many people ever actually
>> implemented it, but it did allow you to keep track of which versions of
>> which scripts had generated which output files in a systematic way.
>> 
>> So here we are today in what I hope is research computing's Bronze Age,
>> and I'm curious: what do you all actually do to keep track of data
>> provenance?  What tools or methods do you use to record which programs
>> produced which output files from which input files with which settings
>> and parameters?  I was excited about the Open Provenance effort circa
>> 2006-07 (https://openprovenance.org/opm/), but it never seemed to catch
>> on.  What are people using instead?
>> 
>> Thanks,
>> 
>> Greg
>> 
>> --
>> If you cannot be brave – and it is often hard to be brave – be kind.
>> 
> 
> The Carpentries / discuss / see discussions + participants + delivery options 
> Permalink

------------------------------------------
The Carpentries: discuss
Permalink: 
https://carpentries.topicbox.com/groups/discuss/Te1cade367c0ab4ee-Maa4849e5f43ef8009e5f87e0
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Re: [discuss] tracking data provenance

Reply via email to