Re: [Discuss] Reproducible computing -- some thoughts and questions

Carl Boettiger Wed, 13 Jan 2016 16:30:27 -0800

Hi all,

Greatly enjoying this discussion.  A few comments to current themes:

On Dockerfiles:

Though I'm a dedicated Docker user (I publish Dockerfiles with my papers
and have exclusively used Dockerized environments, both locally and
remotely, for my computing environment for the past year and a half), I
would push back on the idea that a Dockerfile is a complete recipe.  I
certainly believe writing a Dockerfile is a more practical solution than
asking people to document dependencies manually -- if nothing else, it is
easier to prove that it is the *actual* environment and not just what you
*think* your environment is.

However, a lot still depends on both how you write your Dockerfile, and
many of the issues are really just kicked upstream to things like the
Debian release process.  That's a good solution, since releases tend to
keep most libraries stable (while usually offering backported security
updates for a finite window), but a magic bullet.  And it's pretty easy &
common to write your dockerfile to just pull the latest version of packages
off CRAN or pip install or whatnot and so the recipe builds the bleeding
edge software, not the versions you actually used.  Who knows what
longevity the binary docker images have, but it must be acknowledged that
its the ugly heavy binary images, not the nice Dockerfile snapshot, that
provides the definitive environment.

General reproducibility:

Very much agree with Titus that "what I need to run" feels like a more
thorny problem for me than "what I'm running".  I do wonder if that problem
is more particular to my own work and doesn't afflict the majority of users
in my field who might use a more 'vanilla' / off-the-shelf environment, or
if we would find it to hold more generally if more researchers made "what
I'm running" available in the first place.

I'd love to have a more empirical picture of where reproducibility fails.
Currently I conjecture 99% of research I encounter doesn't share scripts
and probably isn't scripted to begin with, so we don't really reach
'computational environment' issues.  More SWC-style training on using and
sharing scripts is probably the biggest win.  After that, I suspect one
gets pretty far doing what language-specific packages do as far as
documentation: i.e. capturing the descriptions at the level of an R
DESCRIPTION file or python setup.py (or whatever it is).  i.e. this is the
hypothesis that the next most common failure point is due to changes in
high-level packages rather than external C & fortran libraries (think
BLAS),  issues of compiler type & versions, differences in your kernel, or
differences in your hardware or the cosmic rays passing through it (though
we know all of those can matter!).  The problems that Docker does vs does
not address are perhaps a good study in which issues cause the most
practical reproducible problems, and which are special cases.

Damien, on logs & netcdf:  Thanks for the clarifications, very helpful and
we're very much on the same page.  I agree with your picture that
makefiles, a python script, a bash script etc are all operating in the same
space here -- they all provide the dual role of human-readable & executable
workflow of the analysis. (Though personally I would  roll your #4 item on
'logs' into the section on just providing code / scripts.   I think SWC
does a disservice in both pedagogy and reproduciblity in teaching bash and
bash scripting as something totally different than python and python
scripting, but that's a discussion for another time).

On Wed, Jan 13, 2016 at 3:15 PM Tim Head <[email protected]> wrote:

> On Wed, Jan 13, 2016 at 11:46 PM Damien Irving <
> [email protected]> wrote:
>
>> @Tim - I'm interested in your comment regarding Dockerfiles: "A human can
>> read and re-produce it if we lost all the docker tools"
>>
>> Is the same true for any alternatives? For instance, if I put my
>> environment up on anaconda.org is there a dockerfile equivalent that
>> would allow it to be human read and re-produced even if conda disappeared?
>>
>>
> Not sure. You can export your conda environment with:
>
> conda env export > environment.yml
>
> which produces a human readable file. I think, conda packages are "just"
> tarballs. So you could read the environment.yml, obtain the right conda
> packages, and then un-tar them. Something to try out.
>
> I think by default conda packages uploaded to binstar do not have the
> recipe for making them attached/linked. A bit like pushing a docker image
> to a registry without publishing the Dockerfile.
>
> T
> _______________________________________________
> Discuss mailing list
> [email protected]
>
> http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org

-- 

http://carlboettiger.info

_______________________________________________
Discuss mailing list
[email protected]
http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org

Re: [Discuss] Reproducible computing -- some thoughts and questions

Reply via email to