Re: [discuss] tracking data provenance

Doug Latornell Sun, 12 Aug 2018 14:12:02 -0700

This may or may not be off-topic. The ocean modeling folks I work with are
picky about the word "data" (which they reserve for measured observations
from instruments in the water). My comments here are about provenance of
model run results. Of course, in a more generally computational
interpretation of the word "data", model results are data too.


Built into one of the tools we built for running the NEMO ocean model in a
reproducible fashion is code that records information about the revisions
of all of the repos involved in the particular model run. That information
is stored in text files that travel with the model results. They are
created when the run is being prepare for submission to the HPC queue, and
are stored with the run results so that it takes an effort of will to
separate them from the results files (not fool-proof, but generally
effective). An example of the contents of such a file is:

skookum:07aug18$ cat SS-run-sets_rev.txt
changset:   1880:f7eb987b308b82b00043fdaf8def92b6e0f02ee6
tag:        tip
user:       Susan Allen <[email protected]>
date:       Mon Jun 18 16:10:06 2018 -07:00
files:      v201702/hindcast/namelist_smelt_cfg_highzoo
v201702/hindcast/runfiles/13nov14hindcast_alpha.yaml
description:
alpha test for hindcast with new file_def.xml
uncommitted changes:
M v201702/nowcast-green/iodef.xml
skookum:07aug18$

The code that does this can be found in
https://bitbucket.org/salishsea/nemo-cmd/src/default/nemo_cmd/prepare.py#lines-1039
We use Mercurial, so that is the VCS system that it is implemented for, but
something similar should be possible for Git and Subversion. A Git
implementation is on my todo list, but not at a high enough priority to
ever get any effort.

Repo revision files like that are one aspect of the collection of files
about the model run that we preserve with the model results. There are
several other files that describe the model run and its inputs that also
accompany the run results files. The good news is that those kinds of files
are text files so they are small compared to model results files (no good
reason to delete them) and easily human readable. The goal is to have
everything we need in a results directory to be able to reproduce the run
at a later time, and we have good indications that we are successful in
that. Below is the file list from a 1-day model run:

Doug

skookum:07aug18$ ls -1
07aug18nowcast-green.yaml
BoundaryBay.nc
CampbellRiver.nc
CherryPoint.nc
domain_def.xml
field_def.xml
file_def.xml
FridayHarbor.nc
FVCOM_W.nc
grid_rev.txt
HalfmoonBay.nc
iodef.xml
layout.dat
namelist_cfg
namelist_ref
namelist_smelt_cfg
namelist_smelt_ref
namelist_top_cfg
namelist_top_ref
Nanaimo.nc
NeahBay.nc
NEMO-3.6-code_rev.txt
NEMO-Cmd_rev.txt
NEMO_Nowcast_rev.txt
NewWestminster.nc
ocean.output
output.namelist.dyn
output.namelist.sme
output.namelist.top
PatriciaBay.nc
PointAtkinson.nc
PortRenfrew.nc
rivers-climatology_rev.txt
SalishSea_03123360_restart.nc
SalishSea_03123360_restart_trc.nc
SalishSea_1d_20180807_20180807_dia2_T.nc
SalishSea_1d_20180807_20180807_grid_T.nc
SalishSea_1d_20180807_20180807_grid_U.nc
SalishSea_1d_20180807_20180807_grid_V.nc
SalishSea_1d_20180807_20180807_grid_W.nc
SalishSea_1d_20180807_20180807_ptrc_T.nc
SalishSea_1h_20180807_20180807_grid_T.nc
SalishSea_1h_20180807_20180807_grid_U.nc
SalishSea_1h_20180807_20180807_grid_V.nc
SalishSea_1h_20180807_20180807_grid_W.nc
SalishSea_1h_20180807_20180807_ptrc_T.nc
SalishSea_2h_20180807_20180807_dia1_T.nc
SalishSeaCmd_rev.txt
SalishSeaNEMO.sh
SandHeads.nc
SandyCove.nc
Slab_U.nc
Slab_V.nc
solver.stat
Squamish.nc
SS-run-sets_rev.txt
stderr
stdout
tides_rev.txt
time.step
tools_rev.txt
tracers_rev.txt
tracer.stat
VENUS_central_gridded.nc
VENUS_central.nc
VENUS_delta_gridded.nc
VENUS_east_gridded.nc
VENUS_east.nc
Victoria.nc
WoodwardsLanding.nc
XIOS-2_rev.txt
XIOS-ARCH_rev.txt
skookum:07aug18$


On Sun, Aug 12, 2018 at 8:26 AM Dustin Lang via discuss <
[email protected]> wrote:

> Hi Greg,
>
> Down at LegacySurvey.org headquarters, where we are measuring a billion
> stars and galaxies (in ~100,000 "naturally parallel" chunks of sky), we
> record the version numbers of the major python packages in use in all of
> the output files.  The top-level python script lives in a tagged live git
> repo, so we use 'git describe' for that one, and then most of the rest of
> the packages are provided by an idiosyncratic conda-based package
> management system at the supercomputing center we use (NERSC.gov).  All in
> all, it's not pretty.  But you asked what's happening out in the trenches!
> We also log all environment variables and command-line arguments to a text
> file per chunk-of-sky, and all the launch scripts live in our top-level git
> repo; I would prefer to capture more of that directly to the output files,
> but [boring reasons].
>
> Format-wise, our major output files are in FITS format, the ancient
> standard in astronomy, which has a key-value header attached to each image
> or table output.  We also use a nice trick for checksumming: we use the
> write-to-memory mode of the fitsio library to write all outputs to memory,
> compute a sha256 sum on the in-memory data file, then write it to disk, and
> then at the end of the run, we write all the sha256 sums to disk.  This
> allows us to detect silent i/o failures (posix-violating
> not-supposed-to-happen, but we have seen it happen on our lustre fs) where
> we write to disk and then checksum the corrupted data.
>
> Three outstanding problems in our setup:
> (1) we depend on a lot of input data sets (~4 distinct data sets, 1e5ish
> files and many TB), and versioning those is something of a pain.
> (2) our code can checkpoint its state and resume later, but there is no
> guarantee that we are resuming with the same version of the code.  (And
> indeed, often we want to be using a newer version of the code that fixes
> some bug, but we don't want to discard all the correct computation we have
> done up to that point.)  Probably we should just re-log all the versions
> each time we resume.
> (3) we log the versions of the code we're *supposed* to be using, but it's
> possible (via PYTHONPATH, eg) that some other version of a package is
> getting imported; we should log full paths or package __version__ strings
> or whatever.  Similarly, we use 'git describe' to get the top-level code
> version, but (gasp) someone could be running uncommitted code in production
> (yeah, I know, it makes me feel sick too).
>
> cheers,
> --dustin
>
>
>
> On Sun, Aug 12, 2018 at 9:13 AM, Greg Wilson <[email protected]>
> wrote:
>
>> Hi,
>>
>> Back in the Stone Age, Software Carpentry's lessons spent a few minutes
>> discussing data provenance:
>>
>> - Include the string '$Id:$' in every source code file - Subversion would
>> automatically fill in the revision ID on every commit to turn it into
>> something like '$Id: 12345'.
>>
>> - Print the script's name, the commit ID, and the date in the header of
>> every output file (along with all the parameters used by the script).
>>
>> It wasn't much, and I don't know how many people ever actually
>> implemented it, but it did allow you to keep track of which versions of
>> which scripts had generated which output files in a systematic way.
>>
>> So here we are today in what I hope is research computing's Bronze Age,
>> and I'm curious: what do you all actually do to keep track of data
>> provenance?  What tools or methods do you use to record which programs
>> produced which output files from which input files with which settings and
>> parameters?  I was excited about the Open Provenance effort circa 2006-07 (
>> https://openprovenance.org/opm/), but it never seemed to catch on.  What
>> are people using instead?
>>
>> Thanks,
>>
>> Greg
>>
>> --
>> If you cannot be brave – and it is often hard to be brave – be kind.
>>
>>
>> ------------------------------------------
>> The Carpentries: discuss
>> Permalink:
>> https://carpentries.topicbox.com/groups/discuss/Te1cade367c0ab4ee-M703907d77763bffcdf143f1c
>> Delivery options:
>> https://carpentries.topicbox.com/groups/discuss/subscription
>>
>
> *The Carpentries <https://carpentries.topicbox.com/latest>* / discuss /
> see discussions <https://carpentries.topicbox.com/groups/discuss> +
> participants <https://carpentries.topicbox.com/groups/discuss/members> + 
> delivery
> options <https://carpentries.topicbox.com/groups/discuss/subscription>
> Permalink
> <https://carpentries.topicbox.com/groups/discuss/Te1cade367c0ab4ee-M57c776dfaac10eb5e9f5f48a>
>

------------------------------------------
The Carpentries: discuss
Permalink: 
https://carpentries.topicbox.com/groups/discuss/Te1cade367c0ab4ee-Me29af8649ff398218ef49e0e
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Re: [discuss] tracking data provenance

Reply via email to