Hi Greg, Down at LegacySurvey.org headquarters, where we are measuring a billion stars and galaxies (in ~100,000 "naturally parallel" chunks of sky), we record the version numbers of the major python packages in use in all of the output files. The top-level python script lives in a tagged live git repo, so we use 'git describe' for that one, and then most of the rest of the packages are provided by an idiosyncratic conda-based package management system at the supercomputing center we use (NERSC.gov). All in all, it's not pretty. But you asked what's happening out in the trenches! We also log all environment variables and command-line arguments to a text file per chunk-of-sky, and all the launch scripts live in our top-level git repo; I would prefer to capture more of that directly to the output files, but [boring reasons].
Format-wise, our major output files are in FITS format, the ancient standard in astronomy, which has a key-value header attached to each image or table output. We also use a nice trick for checksumming: we use the write-to-memory mode of the fitsio library to write all outputs to memory, compute a sha256 sum on the in-memory data file, then write it to disk, and then at the end of the run, we write all the sha256 sums to disk. This allows us to detect silent i/o failures (posix-violating not-supposed-to-happen, but we have seen it happen on our lustre fs) where we write to disk and then checksum the corrupted data. Three outstanding problems in our setup: (1) we depend on a lot of input data sets (~4 distinct data sets, 1e5ish files and many TB), and versioning those is something of a pain. (2) our code can checkpoint its state and resume later, but there is no guarantee that we are resuming with the same version of the code. (And indeed, often we want to be using a newer version of the code that fixes some bug, but we don't want to discard all the correct computation we have done up to that point.) Probably we should just re-log all the versions each time we resume. (3) we log the versions of the code we're *supposed* to be using, but it's possible (via PYTHONPATH, eg) that some other version of a package is getting imported; we should log full paths or package __version__ strings or whatever. Similarly, we use 'git describe' to get the top-level code version, but (gasp) someone could be running uncommitted code in production (yeah, I know, it makes me feel sick too). cheers, --dustin On Sun, Aug 12, 2018 at 9:13 AM, Greg Wilson <[email protected]> wrote: > Hi, > > Back in the Stone Age, Software Carpentry's lessons spent a few minutes > discussing data provenance: > > - Include the string '$Id:$' in every source code file - Subversion would > automatically fill in the revision ID on every commit to turn it into > something like '$Id: 12345'. > > - Print the script's name, the commit ID, and the date in the header of > every output file (along with all the parameters used by the script). > > It wasn't much, and I don't know how many people ever actually implemented > it, but it did allow you to keep track of which versions of which scripts > had generated which output files in a systematic way. > > So here we are today in what I hope is research computing's Bronze Age, > and I'm curious: what do you all actually do to keep track of data > provenance? What tools or methods do you use to record which programs > produced which output files from which input files with which settings and > parameters? I was excited about the Open Provenance effort circa 2006-07 ( > https://openprovenance.org/opm/), but it never seemed to catch on. What > are people using instead? > > Thanks, > > Greg > > -- > If you cannot be brave – and it is often hard to be brave – be kind. > > > ------------------------------------------ > The Carpentries: discuss > Permalink: https://carpentries.topicbox.com/groups/discuss/Te1cade367c0 > ab4ee-M703907d77763bffcdf143f1c > Delivery options: https://carpentries.topicbox.c > om/groups/discuss/subscription > ------------------------------------------ The Carpentries: discuss Permalink: https://carpentries.topicbox.com/groups/discuss/Te1cade367c0ab4ee-M57c776dfaac10eb5e9f5f48a Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription
