Re: [Discuss] Reproducible computing -- some thoughts and questions

Carl Boettiger Wed, 13 Jan 2016 12:48:06 -0800

Damien,

Thanks for sharing that paper, really excellent read and I generally agree
with all your points there.

A few questions for you:

I'm a bit confused about the role GitHub or version control in general
plays in this vision.  You seem to imply that hallmark examples of
reproduciblity create a "pristine repo" that is "just for the publication",
but console the reader that using there "everyday repo for the project" is
"good enough".  Um, I would have thought that a "pristine" repo somewhat
defeated the purpose of version control in the first place, and the
incentive for this model was largely driven by individuals who either
didn't have an "every day github repo", but were merely posting their code
on GitHub at the time of publication because they heard that was the right
thing to do (this is becoming increasingly common in my field), or because
they didn't want to expose the whole version history of the mistakes and
dead-ends before the final version.  For the purposes of your paper I agree
that either approach is fine, but I see no reason to hold the "pristine
repo" up as a gold standard.  (Sure, curated commit histories are easier to
read than a whole lot of [merge] and stuff, but that's beyond the scope
here.)

More generally, I was glad to see your 'author guidelines' didn't actually
mention version control at all.  For people already using them, sure, they
are an easy way to distribute code.  But otherwise having people just post
their scripts to a data archiving platform is probably at least as adequate
(better for being persistent, if harder to browse etc).

I applaud your proposal for the high-level description and detailed
documentation of version dependencies, but I feel you gloss over the
details far too much here.  I think many researchers simply wouldn't know
just what to put in these sections.  Knowing exactly what dependencies to
document and what not to document isn't obvious to anyone, and I believe we
could benefit from more discussion as a field about this topic.  For
instance, many papers seem to feel it is sufficient to say "analyses were
performed in R (cite R core team)".  Or "In R 3.1.2".  Clearly this is too
little, but where to draw the line isn't clear to me (do I need compiler
details? hardware details? we know they can matter.).  Automated tools help
here (like sessionInfo() in R) but of course nothing is guarantee.  If such
lists were really sufficient to reproduce computational environments with
adequate success and minimal effort, the tool-heavy creatures of
virtual-machines and docker etc would never have taken off in the first
place.  Figuring out what should be included here and why is not trivial.
The situation is equally as complicated for purposes of credit as it is
purposes of reproduciblity (e.g. If I use a package that is a thin wrapper
around some other software, what do I cite?)

The other thing I didn't quite understand was the discussion of log files,
mostly because I don't come from a community where we have anything quite
like netcdf (or the underlying hdf5).  Is the goal here capturing workflow
that takes place outside of a scripted environment? (e.g. sounds like you
are referring to a collection of command-line utilities here rather than
stuff run in a python script)?  Or to capture the data used by the
scripts?  Or to capture the provenance of the data as it passes through
various analyses?  Perhaps your answer is "all of the above", but I remain
a bit confused about what these log files are and what their analog would
be outside of the netcdf context.  It sounds a little like asking for the
(time-stamped) log files of an interactive bash/python/R session rather
than a script.  More generally, I'm curious what if anything a minimal
standard should ask for by way of provenance, or if the script and
description of the computational environment is sufficient.

Anyway, thanks for your thoughts and apologies for the rambling message.

Cheers,
Carl

On Wed, Jan 13, 2016 at 11:26 AM Damien Irving <
[email protected]> wrote:

> Hi Jan,
>
> I have the same concerns regarding many of the "tool heavy" solutions to
> reproducibility out there. Here's an essay I wrote recently proposing a
> solution that requires no special tooling:
> http://journals.ametsoc.org/doi/abs/10.1175/BAMS-D-15-00010.1
>
> and here's the accompanying SWC lesson:
> http://damienirving.github.io/capstone-oceanography/
>
> (I should say that SWC doesn't teach Docker or anything like that - the
> shell, command line programs and automation with Make lessons basically
> teach people everything they need to know to be reproducible. I basically
> wrote the capstone and essay just to make that more explicit, because
> sometimes I'm not sure that SWC participants realise that they've learned
> everything they need).
>
>
> Cheers,
> Damien
>
> <https://github.com/DamienIrving/CV/blob/master/CV.md>
>
>
>
> On Thu, Jan 14, 2016 at 6:06 AM, Jan Kim <[email protected]> wrote:
>
>> Dear All,
>>
>> I'm preparing a talk centred on reproducible computing, and as this is
>> a topic relevant and valued by SWC I'd like to ask your opinion and
>> comments about this.
>>
>> One approach I take is checking to which extent my work from 10 - 20
>> years ago is reproducible today, and (perhaps not surprisingly) I found
>> that having used make, scripts and (relatively) well defined text
>> formats turns out to be higly beneficial in this regard.
>>
>> This has led me to wonder about some of the tools that currently seem
>> to be popular, including on this list, but to me appear unnecessarily
>> fat / overloaded and as such to have an uncertain perspective for long
>> term reproducibility:
>>
>>     * "notebook" systems, and iPython / jupyter in particular:
>>       - Will the JSON format for saving notebooks be readable /
>>         executable in the long term?
>>       - Are these even reproducible in a rigorous sense, considering
>>         that results can vary depending on the order of executing cells?
>>
>>     * Virtual machines and the recent lightweight "containerising"
>>       systems (Docker, Conda): They're undoubtedly a blessing for
>>       reproducibility but
>>       - what are the long term perspectives of executing their images
>>         / environments etc.?
>>       - to which extent is their dependence on backing companies a
>>         reason for concern?
>>
>> I hope that comments on these are relevant / interesting to the SWC
>> community, in addition to providing me with insights / inspiration,
>> and that therefore posting this here is ok.
>>
>> If you have comments on reproducible scientific computing in general,
>> I'm interested as well --  please respond by mailing list or personal
>> reply.
>>
>> Best regards & thanks in advance, Jan
>> --
>>  +- Jan T. Kim -------------------------------------------------------+
>>  |             email: [email protected]                                |
>>  |             WWW:   http://www.jtkim.dreamhosters.com/              |
>>  *-----=<  hierarchical systems are for files, not for humans  >=-----*
>>
>> _______________________________________________
>> Discuss mailing list
>> [email protected]
>>
>> http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org
>>
>
> _______________________________________________
> Discuss mailing list
> [email protected]
>
> http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org

-- 

http://carlboettiger.info

_______________________________________________
Discuss mailing list
[email protected]
http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org

Re: [Discuss] Reproducible computing -- some thoughts and questions

Reply via email to