Hi, A few years ago, some colleagues and I created a tool called recipy (https://github.com/recipy/recipy) that automatically collects provenance information from Python code. Basically you add a single line ('import recipy') to the top of your code, and all inputs and outputs along with code version and package information is automatically tracked in a database. The database can be interrogated through a command-line interface or a web-based GUI. It was created to try and be the 'magic', 'no-effort' solution that everyone wants - so they just don't have to think about this stuff, but it's always recorded. recipy works by hooking into Python's package importing mechanism, and then patching packages to make them call recipy logging functions before they do input/output. Currently we have support for some of the most common Python libraries for data processing including numpy, pandas, matplotlib, BeautifulSoup, lxml, GDAL, nibabel and others. The first version of this code was created at the Software Sustainability Institute Collaborations Workshop a few years ago, and it was then presented at EuroSciPy 2015 (see https://www.youtube.com/watch?v=8tysix6Olgc&t=13s). Since then there has been some development on it, but work has slowed down significantly recently due to my poor health, and other commitments for the other authors. I'm now in a situation where I'd like to work on it but don't have any funding. If anyone is interested in this being further developed and wants to contribute some development effort or even some funding then please get in touch! Best wishes, Robin On 12 August 2018 at 14:14:08, Greg Wilson ([email protected]) wrote: Hi, Back in the Stone Age, Software Carpentry's lessons spent a few minutes discussing data provenance: - Include the string '$Id:$' in every source code file - Subversion would automatically fill in the revision ID on every commit to turn it into something like '$Id: 12345'. - Print the script's name, the commit ID, and the date in the header of every output file (along with all the parameters used by the script). It wasn't much, and I don't know how many people ever actually implemented it, but it did allow you to keep track of which versions of which scripts had generated which output files in a systematic way. So here we are today in what I hope is research computing's Bronze Age, and I'm curious: what do you all actually do to keep track of data provenance? What tools or methods do you use to record which programs produced which output files from which input files with which settings and parameters? I was excited about the Open Provenance effort circa 2006-07 (https://openprovenance.org/opm/), but it never seemed to catch on. What are people using instead? Thanks, Greg -- If you cannot be brave – and it is often hard to be brave – be kind.
------------------------------------------ The Carpentries: discuss Permalink: https://carpentries.topicbox.com/groups/discuss/Te1cade367c0ab4ee-M3cadb9b59850be00c755d516 Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription
