This may or may not be off-topic. The ocean modeling folks I work with are picky about the word "data" (which they reserve for measured observations from instruments in the water). My comments here are about provenance of model run results. Of course, in a more generally computational interpretation of the word "data", model results are data too.
Built into one of the tools we built for running the NEMO ocean model in a reproducible fashion is code that records information about the revisions of all of the repos involved in the particular model run. That information is stored in text files that travel with the model results. They are created when the run is being prepare for submission to the HPC queue, and are stored with the run results so that it takes an effort of will to separate them from the results files (not fool-proof, but generally effective). An example of the contents of such a file is: skookum:07aug18$ cat SS-run-sets_rev.txt changset: 1880:f7eb987b308b82b00043fdaf8def92b6e0f02ee6 tag: tip user: Susan Allen <[email protected]> date: Mon Jun 18 16:10:06 2018 -07:00 files: v201702/hindcast/namelist_smelt_cfg_highzoo v201702/hindcast/runfiles/13nov14hindcast_alpha.yaml description: alpha test for hindcast with new file_def.xml uncommitted changes: M v201702/nowcast-green/iodef.xml skookum:07aug18$ The code that does this can be found in https://bitbucket.org/salishsea/nemo-cmd/src/default/nemo_cmd/prepare.py#lines-1039 We use Mercurial, so that is the VCS system that it is implemented for, but something similar should be possible for Git and Subversion. A Git implementation is on my todo list, but not at a high enough priority to ever get any effort. Repo revision files like that are one aspect of the collection of files about the model run that we preserve with the model results. There are several other files that describe the model run and its inputs that also accompany the run results files. The good news is that those kinds of files are text files so they are small compared to model results files (no good reason to delete them) and easily human readable. The goal is to have everything we need in a results directory to be able to reproduce the run at a later time, and we have good indications that we are successful in that. Below is the file list from a 1-day model run: Doug skookum:07aug18$ ls -1 07aug18nowcast-green.yaml BoundaryBay.nc CampbellRiver.nc CherryPoint.nc domain_def.xml field_def.xml file_def.xml FridayHarbor.nc FVCOM_W.nc grid_rev.txt HalfmoonBay.nc iodef.xml layout.dat namelist_cfg namelist_ref namelist_smelt_cfg namelist_smelt_ref namelist_top_cfg namelist_top_ref Nanaimo.nc NeahBay.nc NEMO-3.6-code_rev.txt NEMO-Cmd_rev.txt NEMO_Nowcast_rev.txt NewWestminster.nc ocean.output output.namelist.dyn output.namelist.sme output.namelist.top PatriciaBay.nc PointAtkinson.nc PortRenfrew.nc rivers-climatology_rev.txt SalishSea_03123360_restart.nc SalishSea_03123360_restart_trc.nc SalishSea_1d_20180807_20180807_dia2_T.nc SalishSea_1d_20180807_20180807_grid_T.nc SalishSea_1d_20180807_20180807_grid_U.nc SalishSea_1d_20180807_20180807_grid_V.nc SalishSea_1d_20180807_20180807_grid_W.nc SalishSea_1d_20180807_20180807_ptrc_T.nc SalishSea_1h_20180807_20180807_grid_T.nc SalishSea_1h_20180807_20180807_grid_U.nc SalishSea_1h_20180807_20180807_grid_V.nc SalishSea_1h_20180807_20180807_grid_W.nc SalishSea_1h_20180807_20180807_ptrc_T.nc SalishSea_2h_20180807_20180807_dia1_T.nc SalishSeaCmd_rev.txt SalishSeaNEMO.sh SandHeads.nc SandyCove.nc Slab_U.nc Slab_V.nc solver.stat Squamish.nc SS-run-sets_rev.txt stderr stdout tides_rev.txt time.step tools_rev.txt tracers_rev.txt tracer.stat VENUS_central_gridded.nc VENUS_central.nc VENUS_delta_gridded.nc VENUS_east_gridded.nc VENUS_east.nc Victoria.nc WoodwardsLanding.nc XIOS-2_rev.txt XIOS-ARCH_rev.txt skookum:07aug18$ On Sun, Aug 12, 2018 at 8:26 AM Dustin Lang via discuss < [email protected]> wrote: > Hi Greg, > > Down at LegacySurvey.org headquarters, where we are measuring a billion > stars and galaxies (in ~100,000 "naturally parallel" chunks of sky), we > record the version numbers of the major python packages in use in all of > the output files. The top-level python script lives in a tagged live git > repo, so we use 'git describe' for that one, and then most of the rest of > the packages are provided by an idiosyncratic conda-based package > management system at the supercomputing center we use (NERSC.gov). All in > all, it's not pretty. But you asked what's happening out in the trenches! > We also log all environment variables and command-line arguments to a text > file per chunk-of-sky, and all the launch scripts live in our top-level git > repo; I would prefer to capture more of that directly to the output files, > but [boring reasons]. > > Format-wise, our major output files are in FITS format, the ancient > standard in astronomy, which has a key-value header attached to each image > or table output. We also use a nice trick for checksumming: we use the > write-to-memory mode of the fitsio library to write all outputs to memory, > compute a sha256 sum on the in-memory data file, then write it to disk, and > then at the end of the run, we write all the sha256 sums to disk. This > allows us to detect silent i/o failures (posix-violating > not-supposed-to-happen, but we have seen it happen on our lustre fs) where > we write to disk and then checksum the corrupted data. > > Three outstanding problems in our setup: > (1) we depend on a lot of input data sets (~4 distinct data sets, 1e5ish > files and many TB), and versioning those is something of a pain. > (2) our code can checkpoint its state and resume later, but there is no > guarantee that we are resuming with the same version of the code. (And > indeed, often we want to be using a newer version of the code that fixes > some bug, but we don't want to discard all the correct computation we have > done up to that point.) Probably we should just re-log all the versions > each time we resume. > (3) we log the versions of the code we're *supposed* to be using, but it's > possible (via PYTHONPATH, eg) that some other version of a package is > getting imported; we should log full paths or package __version__ strings > or whatever. Similarly, we use 'git describe' to get the top-level code > version, but (gasp) someone could be running uncommitted code in production > (yeah, I know, it makes me feel sick too). > > cheers, > --dustin > > > > On Sun, Aug 12, 2018 at 9:13 AM, Greg Wilson <[email protected]> > wrote: > >> Hi, >> >> Back in the Stone Age, Software Carpentry's lessons spent a few minutes >> discussing data provenance: >> >> - Include the string '$Id:$' in every source code file - Subversion would >> automatically fill in the revision ID on every commit to turn it into >> something like '$Id: 12345'. >> >> - Print the script's name, the commit ID, and the date in the header of >> every output file (along with all the parameters used by the script). >> >> It wasn't much, and I don't know how many people ever actually >> implemented it, but it did allow you to keep track of which versions of >> which scripts had generated which output files in a systematic way. >> >> So here we are today in what I hope is research computing's Bronze Age, >> and I'm curious: what do you all actually do to keep track of data >> provenance? What tools or methods do you use to record which programs >> produced which output files from which input files with which settings and >> parameters? I was excited about the Open Provenance effort circa 2006-07 ( >> https://openprovenance.org/opm/), but it never seemed to catch on. What >> are people using instead? >> >> Thanks, >> >> Greg >> >> -- >> If you cannot be brave – and it is often hard to be brave – be kind. >> >> >> ------------------------------------------ >> The Carpentries: discuss >> Permalink: >> https://carpentries.topicbox.com/groups/discuss/Te1cade367c0ab4ee-M703907d77763bffcdf143f1c >> Delivery options: >> https://carpentries.topicbox.com/groups/discuss/subscription >> > > *The Carpentries <https://carpentries.topicbox.com/latest>* / discuss / > see discussions <https://carpentries.topicbox.com/groups/discuss> + > participants <https://carpentries.topicbox.com/groups/discuss/members> + > delivery > options <https://carpentries.topicbox.com/groups/discuss/subscription> > Permalink > <https://carpentries.topicbox.com/groups/discuss/Te1cade367c0ab4ee-M57c776dfaac10eb5e9f5f48a> > ------------------------------------------ The Carpentries: discuss Permalink: https://carpentries.topicbox.com/groups/discuss/Te1cade367c0ab4ee-Me29af8649ff398218ef49e0e Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription
