Re: [discuss] tracking data provenance

Dustin Lang via discuss Sun, 12 Aug 2018 08:26:49 -0700

Hi Greg,

Down at LegacySurvey.org headquarters, where we are measuring a billion
stars and galaxies (in ~100,000 "naturally parallel" chunks of sky), we
record the version numbers of the major python packages in use in all of
the output files.  The top-level python script lives in a tagged live git
repo, so we use 'git describe' for that one, and then most of the rest of
the packages are provided by an idiosyncratic conda-based package
management system at the supercomputing center we use (NERSC.gov).  All in
all, it's not pretty.  But you asked what's happening out in the trenches!
We also log all environment variables and command-line arguments to a text
file per chunk-of-sky, and all the launch scripts live in our top-level git
repo; I would prefer to capture more of that directly to the output files,
but [boring reasons].

Format-wise, our major output files are in FITS format, the ancient
standard in astronomy, which has a key-value header attached to each image
or table output.  We also use a nice trick for checksumming: we use the
write-to-memory mode of the fitsio library to write all outputs to memory,
compute a sha256 sum on the in-memory data file, then write it to disk, and
then at the end of the run, we write all the sha256 sums to disk.  This
allows us to detect silent i/o failures (posix-violating
not-supposed-to-happen, but we have seen it happen on our lustre fs) where
we write to disk and then checksum the corrupted data.

Three outstanding problems in our setup:
(1) we depend on a lot of input data sets (~4 distinct data sets, 1e5ish
files and many TB), and versioning those is something of a pain.
(2) our code can checkpoint its state and resume later, but there is no
guarantee that we are resuming with the same version of the code.  (And
indeed, often we want to be using a newer version of the code that fixes
some bug, but we don't want to discard all the correct computation we have
done up to that point.)  Probably we should just re-log all the versions
each time we resume.
(3) we log the versions of the code we're *supposed* to be using, but it's
possible (via PYTHONPATH, eg) that some other version of a package is
getting imported; we should log full paths or package __version__ strings
or whatever.  Similarly, we use 'git describe' to get the top-level code
version, but (gasp) someone could be running uncommitted code in production
(yeah, I know, it makes me feel sick too).

cheers,
--dustin

On Sun, Aug 12, 2018 at 9:13 AM, Greg Wilson <[email protected]> wrote:

> Hi,
>
> Back in the Stone Age, Software Carpentry's lessons spent a few minutes
> discussing data provenance:
>
> - Include the string '$Id:$' in every source code file - Subversion would
> automatically fill in the revision ID on every commit to turn it into
> something like '$Id: 12345'.
>
> - Print the script's name, the commit ID, and the date in the header of
> every output file (along with all the parameters used by the script).
>
> It wasn't much, and I don't know how many people ever actually implemented
> it, but it did allow you to keep track of which versions of which scripts
> had generated which output files in a systematic way.
>
> So here we are today in what I hope is research computing's Bronze Age,
> and I'm curious: what do you all actually do to keep track of data
> provenance?  What tools or methods do you use to record which programs
> produced which output files from which input files with which settings and
> parameters?  I was excited about the Open Provenance effort circa 2006-07 (
> https://openprovenance.org/opm/), but it never seemed to catch on.  What
> are people using instead?
>
> Thanks,
>
> Greg
>
> --
> If you cannot be brave – and it is often hard to be brave – be kind.
>
>
> ------------------------------------------
> The Carpentries: discuss
> Permalink: https://carpentries.topicbox.com/groups/discuss/Te1cade367c0
> ab4ee-M703907d77763bffcdf143f1c
> Delivery options: https://carpentries.topicbox.c
> om/groups/discuss/subscription
>

------------------------------------------
The Carpentries: discuss
Permalink: 
https://carpentries.topicbox.com/groups/discuss/Te1cade367c0ab4ee-M57c776dfaac10eb5e9f5f48a
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Re: [discuss] tracking data provenance

Reply via email to