Re: [Discuss] Reproducible computing -- some thoughts and questions

Damien Irving Wed, 13 Jan 2016 14:36:13 -0800

Hi Carl,

Thanks for the comments! I've replied inline...



On Thu, Jan 14, 2016 at 7:46 AM, Carl Boettiger <[email protected]> wrote:

> Damien,
>
> Thanks for sharing that paper, really excellent read and I generally agree
> with all your points there.
>
> A few questions for you:
>
> I'm a bit confused about the role GitHub or version control in general
> plays in this vision.  You seem to imply that hallmark examples of
> reproduciblity create a "pristine repo" that is "just for the publication",
> but console the reader that using there "everyday repo for the project" is
> "good enough".  Um, I would have thought that a "pristine" repo somewhat
> defeated the purpose of version control in the first place, and the
> incentive for this model was largely driven by individuals who either
> didn't have an "every day github repo", but were merely posting their code
> on GitHub at the time of publication because they heard that was the right
> thing to do (this is becoming increasingly common in my field), or because
> they didn't want to expose the whole version history of the mistakes and
> dead-ends before the final version.  For the purposes of your paper I agree
> that either approach is fine, but I see no reason to hold the "pristine
> repo" up as a gold standard.  (Sure, curated commit histories are easier to
> read than a whole lot of [merge] and stuff, but that's beyond the scope
> here.)
>
> I wasn't holding up the "pristine repo" as the gold standard. I was
pointing out that while many reproducible papers create a pristine repo, I
think it goes against the workflow that version control promotes and people
should not feel that they have to create a brand-new repo for each paper
they write. (i.e. I think we are in agree on this point).



> More generally, I was glad to see your 'author guidelines' didn't actually
> mention version control at all.  For people already using them, sure, they
> are an easy way to distribute code.  But otherwise having people just post
> their scripts to a data archiving platform is probably at least as adequate
> (better for being persistent, if harder to browse etc).
>
> I applaud your proposal for the high-level description and detailed
> documentation of version dependencies, but I feel you gloss over the
> details far too much here.  I think many researchers simply wouldn't know
> just what to put in these sections.  Knowing exactly what dependencies to
> document and what not to document isn't obvious to anyone, and I believe we
> could benefit from more discussion as a field about this topic.  For
> instance, many papers seem to feel it is sufficient to say "analyses were
> performed in R (cite R core team)".  Or "In R 3.1.2".  Clearly this is too
> little, but where to draw the line isn't clear to me (do I need compiler
> details? hardware details? we know they can matter.).  Automated tools help
> here (like sessionInfo() in R) but of course nothing is guarantee.  If such
> lists were really sufficient to reproduce computational environments with
> adequate success and minimal effort, the tool-heavy creatures of
> virtual-machines and docker etc would never have taken off in the first
> place.  Figuring out what should be included here and why is not trivial.
> The situation is equally as complicated for purposes of credit as it is
> purposes of reproduciblity (e.g. If I use a package that is a thin wrapper
> around some other software, what do I cite?)
>
> Yeah, I tend to agree with you here. If I was to point out a weakness with
my proposal, this would be it. I just really dislike the idea that
researchers would need to learn yet another tool (e.g. Docker) just to
document their environment. Particularly when the community is yet to
settle on *the* tool for the job. That's why I settled for a simple list of
dependencies - it seemed the only reasonable option open to a "regular
researcher" (i.e. someone without a strong programming background) right
now. In hindsight, I should have discussed this issue/limitation somewhere
in the essay.



> The other thing I didn't quite understand was the discussion of log files,
> mostly because I don't come from a community where we have anything quite
> like netcdf (or the underlying hdf5).  Is the goal here capturing workflow
> that takes place outside of a scripted environment? (e.g. sounds like you
> are referring to a collection of command-line utilities here rather than
> stuff run in a python script)?  Or to capture the data used by the
> scripts?  Or to capture the provenance of the data as it passes through
> various analyses?  Perhaps your answer is "all of the above", but I remain
> a bit confused about what these log files are and what their analog would
> be outside of the netcdf context.  It sounds a little like asking for the
> (time-stamped) log files of an interactive bash/python/R session rather
> than a script.  More generally, I'm curious what if anything a minimal
> standard should ask for by way of provenance, or if the script and
> description of the computational environment is sufficient.
>
> The log files keep a record of each step of the process/workflow, as it
was entered at the command line. So thinking from a SWC perspective, you
first learn how to use the command line. Then you learn how to program
(e.g. with Python) and how to write scripts that can be executed from the
command line. You then learn how to version those scripts using Git.
Without realising it, you now know everything you need to know to do
reproducible research.

As an example, let's say your workflow requires the execution of 3
different command line scripts (e.g. regrid_data.py, calc_average.py and
plot_results.py). Your log file (of command line entries, which I'm calling
figure1.met as an example) might look like this:

$ cat figure1.met

22 July 2014: python plot_results.py average_data.dat figure1.png  (Git
hash: t9663f4)
21 July 2014: python calc_average.py regridded_data.dat average_data.dat
--bounds 15E 70E --weighted  (Git hash: b9443e5)
20 July 2014: python regird_data.py orig_data.dat regridded_data.dat
--method conservative   (Git hash: a9573e4)


You could document this workflow by supplying the associated Makefile, but
that requires the reader to understand how makefiles work. Make isn't the
only workflow management tool out there, so that's a little unfair.
Instead, you just keep a record of the command line entries as you go along
(as described in this lesson:
http://damienirving.github.io/capstone-oceanography/). If you are using a
self-describing file format like HDF5 or netCDF, you can put that record in
the file metadata itself. Otherwise, you can create metadata files that
have exactly the same name as your data files, but that end with the
extension .met. Once you've arrived at the end of your workflow (i.e.
figure1.png), you can simply provide figure1.met (i.e. your log file), a
link to the associated code repo and a description of the environment the
code was executed in and hey presto, you've just done reproducible research
without any special tooling.

I've always been a little confused about how this idea would play out for
an R or MATLAB user (they don't seem to write command line programs??), but
I've never really got a straight answer. I would also say that my proposed
minimum standard doesn't say that the log files have to contain a list of
command line entries. That's just my suggested approach. They could equally
represent detailed README documents explaining how the workflow was run.
Those would need to be manually created (and hence would be more time
consuming and would risk not containing enough info to exactly reproduce
the workflow), but they would also comply with my minimum standard. (Which
is perhaps something I should also have pointed out more clearly in the
essay, but the word limit was very tight!)



> Anyway, thanks for your thoughts and apologies for the rambling message.
>
> Thank you!



> Cheers,
> Carl
>
>


> On Wed, Jan 13, 2016 at 11:26 AM Damien Irving <
> [email protected]> wrote:
>
>> Hi Jan,
>>
>> I have the same concerns regarding many of the "tool heavy" solutions to
>> reproducibility out there. Here's an essay I wrote recently proposing a
>> solution that requires no special tooling:
>> http://journals.ametsoc.org/doi/abs/10.1175/BAMS-D-15-00010.1
>>
>> and here's the accompanying SWC lesson:
>> http://damienirving.github.io/capstone-oceanography/
>>
>> (I should say that SWC doesn't teach Docker or anything like that - the
>> shell, command line programs and automation with Make lessons basically
>> teach people everything they need to know to be reproducible. I basically
>> wrote the capstone and essay just to make that more explicit, because
>> sometimes I'm not sure that SWC participants realise that they've learned
>> everything they need).
>>
>>
>> Cheers,
>> Damien
>>
>>
>>
>>
>>
>> On Thu, Jan 14, 2016 at 6:06 AM, Jan Kim <[email protected]> wrote:
>>
>>> Dear All,
>>>
>>> I'm preparing a talk centred on reproducible computing, and as this is
>>> a topic relevant and valued by SWC I'd like to ask your opinion and
>>> comments about this.
>>>
>>> One approach I take is checking to which extent my work from 10 - 20
>>> years ago is reproducible today, and (perhaps not surprisingly) I found
>>> that having used make, scripts and (relatively) well defined text
>>> formats turns out to be higly beneficial in this regard.
>>>
>>> This has led me to wonder about some of the tools that currently seem
>>> to be popular, including on this list, but to me appear unnecessarily
>>> fat / overloaded and as such to have an uncertain perspective for long
>>> term reproducibility:
>>>
>>>     * "notebook" systems, and iPython / jupyter in particular:
>>>       - Will the JSON format for saving notebooks be readable /
>>>         executable in the long term?
>>>       - Are these even reproducible in a rigorous sense, considering
>>>         that results can vary depending on the order of executing cells?
>>>
>>>     * Virtual machines and the recent lightweight "containerising"
>>>       systems (Docker, Conda): They're undoubtedly a blessing for
>>>       reproducibility but
>>>       - what are the long term perspectives of executing their images
>>>         / environments etc.?
>>>       - to which extent is their dependence on backing companies a
>>>         reason for concern?
>>>
>>> I hope that comments on these are relevant / interesting to the SWC
>>> community, in addition to providing me with insights / inspiration,
>>> and that therefore posting this here is ok.
>>>
>>> If you have comments on reproducible scientific computing in general,
>>> I'm interested as well --  please respond by mailing list or personal
>>> reply.
>>>
>>> Best regards & thanks in advance, Jan
>>> --
>>>  +- Jan T. Kim -------------------------------------------------------+
>>>  |             email: [email protected]                                |
>>>  |             WWW:   http://www.jtkim.dreamhosters.com/              |
>>>  *-----=<  hierarchical systems are for files, not for humans  >=-----*
>>>
>>> _______________________________________________
>>> Discuss mailing list
>>> [email protected]
>>>
>>> http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org
>>>
>>
>> _______________________________________________
>> Discuss mailing list
>> [email protected]
>>
>> http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org
>
> --
>
> http://carlboettiger.info
>

_______________________________________________
Discuss mailing list
[email protected]
http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org

Re: [Discuss] Reproducible computing -- some thoughts and questions

Reply via email to