Re: [Discuss] Reproducible computing -- some thoughts and questions

Ben Marwick Wed, 13 Jan 2016 15:13:56 -0800

I'm a little bit curious why R hasn't had any success with cell-basednotebook-style interfaces. There have been a few related good efforts,(https://github.com/ramnathv/rNotebook,https://github.com/swarm-lab/editR) and R kernels for IPython/jupyter.But it seems like most R users are sticking with the noweb-styleSweave/knitr approach. Recent versions of RStudio make knitr documentsfeel a little similar to a jupyter notebook, with tiny chunk-levelcontrols for running chunks (cf. cells) and changing options. Soperhaps there is a bit of convergence.

The JSON source files of jupyter don't suit me personally because I editthe source files in more than one type of editor. So the simpler thetext format, the easier to edit. But this is more of a preference, thana critical issue for reproducibility. The big gains in reproducibility(in my field, at least) will be getting people to write scripts andstore them plain text files. If all of my colleagues ditch Excel andSPSS for jupyter, I'll count that as progress and join in, even thoughit's not my preference.

I still struggle a bit with Docker on windows, and sympathize withpeople who hesitate about Docker. But for managing dependencies outsideof R for cross-platform use, it's currently the simplest and mostefficient solution, and that alone is worth the effort with Docker. Whatmakes me feel more comfortable about it that when Docker fades awaywe'll still have our plain-text dockerfiles that we should be able totranslate into whatever the next big virtualization thing is (as Timjust wrote). Here, the big gains in reproducibility is to get peoplecommunicating the gory details of the software dependencies of theirresearch pipeline. If we can normalize that behavior, and plain textfiles are involved, the specific software tool that we do it with isless relevant to the problem.

A general comment on reproducibility in science: for most researcherswho are not bio/neuro/etc.-informaticians or computational x-ists, thesoftware and services for highly reproducible research works well, withseveral options depending on your preference and needs. What's not'here' yet is scientists valuing the abstract, tool-agnostic principlesof reproducibility enough to implement them in their day-to-day work.Changing that seems to be quite a challenge!


Ben

On 14/1/2016 7:46 AM, Carl Boettiger wrote:

Damien,

Thanks for sharing that paper, really excellent read and I generally
agree with all your points there.

A few questions for you:

I'm a bit confused about the role GitHub or version control in general
plays in this vision.  You seem to imply that hallmark examples of
reproduciblity create a "pristine repo" that is "just for the
publication", but console the reader that using there "everyday repo for
the project" is "good enough".  Um, I would have thought that a
"pristine" repo somewhat defeated the purpose of version control in the
first place, and the incentive for this model was largely driven by
individuals who either didn't have an "every day github repo", but were
merely posting their code on GitHub at the time of publication because
they heard that was the right thing to do (this is becoming increasingly
common in my field), or because they didn't want to expose the whole
version history of the mistakes and dead-ends before the final version.
For the purposes of your paper I agree that either approach is fine, but
I see no reason to hold the "pristine repo" up as a gold standard.
  (Sure, curated commit histories are easier to read than a whole lot of
[merge] and stuff, but that's beyond the scope here.)

More generally, I was glad to see your 'author guidelines' didn't
actually mention version control at all.  For people already using them,
sure, they are an easy way to distribute code.  But otherwise having
people just post their scripts to a data archiving platform is probably
at least as adequate (better for being persistent, if harder to browse
etc).

I applaud your proposal for the high-level description and detailed
documentation of version dependencies, but I feel you gloss over the
details far too much here.  I think many researchers simply wouldn't
know just what to put in these sections.  Knowing exactly what
dependencies to document and what not to document isn't obvious to
anyone, and I believe we could benefit from more discussion as a field
about this topic.  For instance, many papers seem to feel it is
sufficient to say "analyses were performed in R (cite R core team)".  Or
"In R 3.1.2".  Clearly this is too little, but where to draw the line
isn't clear to me (do I need compiler details? hardware details? we know
they can matter.).  Automated tools help here (like sessionInfo() in R)
but of course nothing is guarantee.  If such lists were really
sufficient to reproduce computational environments with adequate success
and minimal effort, the tool-heavy creatures of virtual-machines and
docker etc would never have taken off in the first place.  Figuring out
what should be included here and why is not trivial. The situation is
equally as complicated for purposes of credit as it is purposes of
reproduciblity (e.g. If I use a package that is a thin wrapper around
some other software, what do I cite?)

The other thing I didn't quite understand was the discussion of log
files, mostly because I don't come from a community where we have
anything quite like netcdf (or the underlying hdf5).  Is the goal here
capturing workflow that takes place outside of a scripted environment?
(e.g. sounds like you are referring to a collection of command-line
utilities here rather than stuff run in a python script)?  Or to capture
the data used by the scripts?  Or to capture the provenance of the data
as it passes through various analyses?  Perhaps your answer is "all of
the above", but I remain a bit confused about what these log files are
and what their analog would be outside of the netcdf context.  It sounds
a little like asking for the (time-stamped) log files of an interactive
bash/python/R session rather than a script.  More generally, I'm curious
what if anything a minimal standard should ask for by way of provenance,
or if the script and description of the computational environment is
sufficient.

Anyway, thanks for your thoughts and apologies for the rambling message.

Cheers,
Carl

On Wed, Jan 13, 2016 at 11:26 AM Damien Irving
<[email protected]
<mailto:[email protected]>> wrote:

    Hi Jan,

    I have the same concerns regarding many of the "tool heavy"
    solutions to reproducibility out there. Here's an essay I wrote
    recently proposing a solution that requires no special tooling:
    http://journals.ametsoc.org/doi/abs/10.1175/BAMS-D-15-00010.1

    and here's the accompanying SWC lesson:
    http://damienirving.github.io/capstone-oceanography/

    (I should say that SWC doesn't teach Docker or anything like that -
    the shell, command line programs and automation with Make lessons
    basically teach people everything they need to know to be
    reproducible. I basically wrote the capstone and essay just to make
    that more explicit, because sometimes I'm not sure that SWC
    participants realise that they've learned everything they need).


    Cheers,
    Damien

    <https://github.com/DamienIrving/CV/blob/master/CV.md>



    On Thu, Jan 14, 2016 at 6:06 AM, Jan Kim <[email protected]
    <mailto:[email protected]>> wrote:

        Dear All,

        I'm preparing a talk centred on reproducible computing, and as
        this is
        a topic relevant and valued by SWC I'd like to ask your opinion and
        comments about this.

        One approach I take is checking to which extent my work from 10 - 20
        years ago is reproducible today, and (perhaps not surprisingly)
        I found
        that having used make, scripts and (relatively) well defined text
        formats turns out to be higly beneficial in this regard.

        This has led me to wonder about some of the tools that currently
        seem
        to be popular, including on this list, but to me appear
        unnecessarily
        fat / overloaded and as such to have an uncertain perspective
        for long
        term reproducibility:

             * "notebook" systems, and iPython / jupyter in particular:
               - Will the JSON format for saving notebooks be readable /
                 executable in the long term?
               - Are these even reproducible in a rigorous sense,
        considering
                 that results can vary depending on the order of
        executing cells?

             * Virtual machines and the recent lightweight "containerising"
               systems (Docker, Conda): They're undoubtedly a blessing for
               reproducibility but
               - what are the long term perspectives of executing their
        images
                 / environments etc.?
               - to which extent is their dependence on backing companies a
                 reason for concern?

        I hope that comments on these are relevant / interesting to the SWC
        community, in addition to providing me with insights / inspiration,
        and that therefore posting this here is ok.

        If you have comments on reproducible scientific computing in
        general,
        I'm interested as well --  please respond by mailing list or
        personal
        reply.

        Best regards & thanks in advance, Jan
        --
          +- Jan T. Kim
        -------------------------------------------------------+
          |             email: [email protected]
        <mailto:[email protected]>                                |
          |             WWW: http://www.jtkim.dreamhosters.com/
             |
          *-----=<  hierarchical systems are for files, not for humans
         >=-----*

        _______________________________________________
        Discuss mailing list
        [email protected]
        <mailto:[email protected]>
        
http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org


    _______________________________________________
    Discuss mailing list
    [email protected]
    <mailto:[email protected]>
    
http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org

--

http://carlboettiger.info



_______________________________________________
Discuss mailing list
[email protected]
http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org


_______________________________________________
Discuss mailing list
[email protected]
http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org

Re: [Discuss] Reproducible computing -- some thoughts and questions

Reply via email to