Re: [discuss] tracking data provenance

Dav Clark via discuss Sun, 12 Aug 2018 07:43:35 -0700

I just kicked the tires on DVC, and it provides a nice mechanism for
manually tracking inputs and outputs along with code files - all hashed in
YAML files.


If you want something automatic, the Gigantum client hooks into Jupyter's
notification system and notes the version, contents of the code cell, and
outputs including thumbnails for images (for at least R and Python). Since
we track the environment and use docker, you should have exceptional
reproducibility. Maybe we are actually in the early industrial age? I'm
reminded a bit of the poem about John Henry...

I've mentioned it a few times before, but there hasn't yet been any
interest in the list, which is a bummer.

But others have reported that it's very easy to get started, and you can
find info on trying the demo server or installing the open source client
here:

http://gigantum.com

I'm happy to provide more pointers to our activity record format, but the
easiest way to get started would be to look at an example and browse the
activity tab in the client. The client is a completely different web
interface from Jupyter and is meant to generally replace what you'd do on
the command line. It's still just Git and Docker under the hood, and the
activity record is stored in Berkeley DB IIRC.

I'm happy to help folks dig in, and field any questions or criticism.

Best,
D

On Sun, Aug 12, 2018, 9:56 AM naupaka via discuss <
[email protected]> wrote:

> I remember playing with Sumatra several years ago. I believe the approach
> is to track all that metadata in a SQLite db and then make it
> browsable/accessible with a Django web app.
>
> http://neuralensemble.org/sumatra/
>
> In the R world many folks have taken to appending `sessionInfo()` or
> `devtools::session_info()` to the end of an Rmd file to track packages
> attached, etc. The latter also gives SHAs for packages installed from
> GitHub. Wouldn’t be that hard to also start including a shell chunk with
> `git rev-parse HEAD` to include the local repo commit info.
>
> Here’s the old discussion on this I remember from several years ago:
> https://github.com/swcarpentry/DEPRECATED-site/issues/1085
>
> Best,
> Naupaka
>
> On Aug 12, 2018, at 6:30 AM, Bruce Becker via discuss <
> [email protected]> wrote:
>
> Hi Greg, all
> I'm not sure about the Bronze Age, but in the Baroque era my understanding
> is that this is the job of metadata. You need a lot of machinery to do
> this, but in this era, data never lives "nakedly", but it always
> accompanied by metadata which describes it. So, you look up data by it's
> persistent identifier, in repositories, and deposit it, along with it's
> changelog or whatever, in repositories.
>
> I am the first to concede that many, if not the vast majority of data
> civilisations will ever reach the Baroque age - and perhaps others will
> skip it altogether, but this happens to be the civilisation I'm writing to
> you from. I'd hazard the suggestion that the Baroque Age is also known as
> the Open Science age, just to be prickly.
>
> Have a great sunday!
> Bruce
>
> On Sun, 12 Aug 2018 at 15:15, Greg Wilson <[email protected]> wrote:
>
>> Hi,
>>
>> Back in the Stone Age, Software Carpentry's lessons spent a few minutes
>> discussing data provenance:
>>
>> - Include the string '$Id:$' in every source code file - Subversion
>> would automatically fill in the revision ID on every commit to turn it
>> into something like '$Id: 12345'.
>>
>> - Print the script's name, the commit ID, and the date in the header of
>> every output file (along with all the parameters used by the script).
>>
>> It wasn't much, and I don't know how many people ever actually
>> implemented it, but it did allow you to keep track of which versions of
>> which scripts had generated which output files in a systematic way.
>>
>> So here we are today in what I hope is research computing's Bronze Age,
>> and I'm curious: what do you all actually do to keep track of data
>> provenance?  What tools or methods do you use to record which programs
>> produced which output files from which input files with which settings
>> and parameters?  I was excited about the Open Provenance effort circa
>> 2006-07 (https://openprovenance.org/opm/), but it never seemed to catch
>> on.  What are people using instead?
>>
>> Thanks,
>>
>> Greg
>>
>> --
>> If you cannot be brave – and it is often hard to be brave – be kind.
>>
>>
>> ------------------------------------------
>> The Carpentries: discuss
>> Permalink:
>> https://carpentries.topicbox.com/groups/discuss/Te1cade367c0ab4ee-M703907d77763bffcdf143f1c
>> Delivery options:
>> https://carpentries.topicbox.com/groups/discuss/subscription
>>
> *The Carpentries <https://carpentries.topicbox.com/latest>* / discuss /
> see discussions <https://carpentries.topicbox.com/groups/discuss> +
> participants <https://carpentries.topicbox.com/groups/discuss/members> + 
> delivery
> options <https://carpentries.topicbox.com/groups/discuss/subscription>
> Permalink
> <https://carpentries.topicbox.com/groups/discuss/Te1cade367c0ab4ee-Maa4849e5f43ef8009e5f87e0>
>

------------------------------------------
The Carpentries: discuss
Permalink: 
https://carpentries.topicbox.com/groups/discuss/Te1cade367c0ab4ee-Me56ba6ec5eb169a82bce92a9
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Re: [discuss] tracking data provenance

Reply via email to