Re: [discuss] tracking data provenance

Robin Wilson Sun, 12 Aug 2018 07:56:29 -0700

 Hi,
 
 A few years ago, some colleagues and I created a tool called recipy 
(https://github.com/recipy/recipy) that automatically collects provenance 
information from Python code. Basically you add a single line ('import recipy') 
to the top of your code, and all inputs and outputs along with code version and 
package information is automatically tracked in a database. The database can be 
interrogated through a command-line interface or a web-based GUI. It was 
created to try and be the 'magic', 'no-effort' solution that everyone wants - 
so they just don't have to think about this stuff, but it's always recorded.
 
 recipy works by hooking into Python's package importing mechanism, and then 
patching packages to make them call recipy logging functions before they do 
input/output. Currently we have support for some of the most common Python 
libraries for data processing including numpy, pandas, matplotlib, 
BeautifulSoup, lxml, GDAL, nibabel and others.
 
 The first version of this code was created at the Software Sustainability 
Institute Collaborations Workshop a few years ago, and it was then presented at 
EuroSciPy 2015 (see https://www.youtube.com/watch?v=8tysix6Olgc&t=13s). Since 
then there has been some development on it, but work has slowed down 
significantly recently due to my poor health, and other commitments for the 
other authors. I'm now in a situation where I'd like to work on it but don't 
have any funding. If anyone is interested in this being further developed and 
wants to contribute some development effort or even some funding then please 
get in touch!
 
 Best wishes,
 
 Robin
 On 12 August 2018 at 14:14:08, Greg Wilson ([email protected]) wrote:
 
 Hi,
 
 Back in the Stone Age, Software Carpentry's lessons spent a few minutes
 discussing data provenance:
 
 - Include the string '$Id:$' in every source code file - Subversion
 would automatically fill in the revision ID on every commit to turn it
 into something like '$Id: 12345'.
 
 - Print the script's name, the commit ID, and the date in the header of
 every output file (along with all the parameters used by the script).
 
 It wasn't much, and I don't know how many people ever actually
 implemented it, but it did allow you to keep track of which versions of
 which scripts had generated which output files in a systematic way.
 
 So here we are today in what I hope is research computing's Bronze Age,
 and I'm curious: what do you all actually do to keep track of data
 provenance?  What tools or methods do you use to record which programs
 produced which output files from which input files with which settings
 and parameters?  I was excited about the Open Provenance effort circa
 2006-07 (https://openprovenance.org/opm/), but it never seemed to catch
 on.  What are people using instead?
 
 Thanks,
 
 Greg
 
 --
 If you cannot be brave – and it is often hard to be brave – be kind.


------------------------------------------
The Carpentries: discuss
Permalink: 
https://carpentries.topicbox.com/groups/discuss/Te1cade367c0ab4ee-M3cadb9b59850be00c755d516
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Re: [discuss] tracking data provenance

Reply via email to