Re: [discuss] tracking data provenance

Michael Zingale Sun, 12 Aug 2018 07:49:25 -0700

we generate a source file at build time containing the metadata for our
build environment for our hydro codes, https://github.com/AMReX-Astro/
This captures the git hashes and compiler versions and flags, build date,
location, etc. and stores it in output (and example of some of the
information is appended below).  Runtime parameters are also included.
Still needed to do is to transfer this info the image/PDF metadata for
plots generated from our output.


the script that is invoked by make is here:

https://github.com/AMReX-Codes/amrex/blob/master/Tools/C_scripts/makebuildinfo_C.py


 Castro Job Information
job name:

inputs file: inputs.3d.sph

number of MPI processes: 1

CPU time used since start of simulation (CPU-hours): 5.89267e-05

 Plotfile Information
output data / time: Sat Aug 11 20:14:49 2018
output dir:         /home/zingale/development/Castro/Exec/hydro_tests/Sedov


 Build Information
build date:    2018-08-11 20:14:16.905047
build machine: Linux localhost.localdomain 4.17.12-200.fc28.x86_64 #1 SMP
Fri Au
g 3 15:01:13 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
build dir:     /home/zingale/development/Castro/Exec/hydro_tests/Sedov
AMReX dir:     /home/zingale/development/AMReX/

COMP:          gnu
COMP version:  8.1.1

C++ compiler:  mpicxx
C++ flags:      -g -O3  -DNDEBUG -DBL_USE_MPI -DAMREX_USE_MPI
-DAMREX_GIT_VERSIO
N="18.07-127-g5ae9071ec99c" -DBL_GCC_VERSION=8.1.1 -DBL_GCC_MAJOR_VERSION=8
-DBL
_GCC_MINOR_VERSION=1 -DAMREX_LAUNCH= -DAMREX_DEVICE= -DBL_SPACEDIM=3
-DAMREX_SPA
CEDIM=3 -DBL_FORT_USE_UNDERSCORE -DAMREX_FORT_USE_UNDERSCORE -DBL_Linux
-DAMREX_
Linux -DCRSEGRNDOMP -DSPONGE -I.
-I/home/zingale/development/AMReX//Src/Base -I/
home/zingale/development/AMReX//Src/AmrCore
-I/home/zingale/development/AMReX//S
rc/Amr -I/home/zingale/development/AMReX//Src/Boundary
-I/home/zingale/developme
nt/Microphysics/util -I. -I../../../Source/driver
-I../../../Source/driver/param
_includes -I../../../Source/hydro -I../../../Source/problems
-I../../../Source/s
ources -I../../../constants -I../../../Util/model_parser
-I../../../Microphysics
/EOS -I../../../Microphysics/EOS/gamma_law -I../../../Microphysics/networks
-I..
/../../Microphysics/networks/general_null
-I/home/zingale/development/Microphysi
cs/EOS -I/home/zingale/development/Microphysics/networks
-I/home/zingale/develop
ment/AMReX//Tools/C_scripts

Fortran comp:  mpif90
Fortran flags:  -g -O3 -ffree-line-length-none -fno-range-check
-fno-second-unde
rscore -Jtmp_build_dir/o/3d.gnu.MPI.EXE -I tmp_build_dir/o/3d.gnu.MPI.EXE
-fimpl
icit-none

Link flags:    -L. -L/usr/lib/gcc/x86_64-redhat-linux/8/
Libraries:     -m64 -O2 -fPIC -Wl,-z,noexecstack
-I/usr/include/mpich-x86_64 -I/
usr/lib64/gfortran/modules/mpich -L/usr/lib64/mpich/lib -lmpifort
-Wl,-rpath -Wl
,/usr/lib64/mpich/lib -Wl,--enable-new-dtags -lmpi -lgfortran -lquadmath

EOS: ../../../Microphysics/EOS/gamma_law
NETWORK: ../../../Microphysics/networks/general_null

Castro       git describe: 18.08-20-g1574ce2b-dirty
AMReX        git describe: 18.07-127-g5ae9071ec
Microphysics git describe: 18.08





On Sun, Aug 12, 2018 at 10:37 AM Greg Wilson <[email protected]> wrote:

> Hi Naupaka; thanks for your mail.  I played with Sumatra a couple of times
> as well, but it didn't stick - what I'm chasing now are things people are
> actually using in small- to medium-sized projects.  (The way CERN and STScI
> handle metadata is cool, and I'm grateful for it, but it doesn't scale down
> to what most of us do in the lab.)  The sessionInfo() trick is cool - what
> else are people using?
>
> Thanks,
>
> Greg
>
> On 2018-08-12 9:55 AM, naupaka via discuss wrote:
>
> I remember playing with Sumatra several years ago. I believe the approach
> is to track all that metadata in a SQLite db and then make it
> browsable/accessible with a Django web app.
>
> http://neuralensemble.org/sumatra/
>
> In the R world many folks have taken to appending `sessionInfo()` or
> `devtools::session_info()` to the end of an Rmd file to track packages
> attached, etc. The latter also gives SHAs for packages installed from
> GitHub. Wouldn’t be that hard to also start including a shell chunk with
> `git rev-parse HEAD` to include the local repo commit info.
>
> Here’s the old discussion on this I remember from several years ago:
> https://github.com/swcarpentry/DEPRECATED-site/issues/1085
>
> Best,
> Naupaka
>
> On Aug 12, 2018, at 6:30 AM, Bruce Becker via discuss <
> [email protected]> wrote:
>
> Hi Greg, all
> I'm not sure about the Bronze Age, but in the Baroque era my understanding
> is that this is the job of metadata. You need a lot of machinery to do
> this, but in this era, data never lives "nakedly", but it always
> accompanied by metadata which describes it. So, you look up data by it's
> persistent identifier, in repositories, and deposit it, along with it's
> changelog or whatever, in repositories.
>
> I am the first to concede that many, if not the vast majority of data
> civilisations will ever reach the Baroque age - and perhaps others will
> skip it altogether, but this happens to be the civilisation I'm writing to
> you from. I'd hazard the suggestion that the Baroque Age is also known as
> the Open Science age, just to be prickly.
>
> Have a great sunday!
> Bruce
>
> On Sun, 12 Aug 2018 at 15:15, Greg Wilson <[email protected]> wrote:
>
>> Hi,
>>
>> Back in the Stone Age, Software Carpentry's lessons spent a few minutes
>> discussing data provenance:
>>
>> - Include the string '$Id:$' in every source code file - Subversion
>> would automatically fill in the revision ID on every commit to turn it
>> into something like '$Id: 12345'.
>>
>> - Print the script's name, the commit ID, and the date in the header of
>> every output file (along with all the parameters used by the script).
>>
>> It wasn't much, and I don't know how many people ever actually
>> implemented it, but it did allow you to keep track of which versions of
>> which scripts had generated which output files in a systematic way.
>>
>> So here we are today in what I hope is research computing's Bronze Age,
>> and I'm curious: what do you all actually do to keep track of data
>> provenance?  What tools or methods do you use to record which programs
>> produced which output files from which input files with which settings
>> and parameters?  I was excited about the Open Provenance effort circa
>> 2006-07 (https://openprovenance.org/opm/), but it never seemed to catch
>> on.  What are people using instead?
>>
>> Thanks,
>>
>> Greg
>>
>> --
>> If you cannot be brave – and it is often hard to be brave – be kind.
>>
>>
>> ------------------------------------------
>> The Carpentries: discuss
>> Permalink:
>> https://carpentries.topicbox.com/groups/discuss/Te1cade367c0ab4ee-M703907d77763bffcdf143f1c
>> Delivery options:
>> https://carpentries.topicbox.com/groups/discuss/subscription
>>
>
> --
> If you cannot be brave – and it is often hard to be brave – be kind.
>
> *The Carpentries <https://carpentries.topicbox.com/latest>* / discuss /
> see discussions <https://carpentries.topicbox.com/groups/discuss> +
> participants <https://carpentries.topicbox.com/groups/discuss/members> + 
> delivery
> options <https://carpentries.topicbox.com/groups/discuss/subscription>
> Permalink
> <https://carpentries.topicbox.com/groups/discuss/Te1cade367c0ab4ee-M64f6d20b399ac7970f99a297>
>


-- 
Michael Zingale
Associate Professor

Dept. of Physics & Astronomy • Stony Brook University • Stony Brook, NY
11794-3800
*phone*:  631-632-8225
*e-mail*: [email protected]
*web*: http://www.astro.sunysb.edu/mzingale
github: http://github.com/zingale

------------------------------------------
The Carpentries: discuss
Permalink: 
https://carpentries.topicbox.com/groups/discuss/Te1cade367c0ab4ee-M5fc4d12ec26b0bade2755871
Delivery options: https://carpentries.topicbox.com/groups/discuss/subscription

Re: [discuss] tracking data provenance

Reply via email to