Re: [Discuss] Reproducible computing -- some thoughts and questions

C. Titus Brown Wed, 13 Jan 2016 12:52:56 -0800

I love the question!  Konrad Hinsen has made similar comments several times on
my blog - latest here
(http://ivory.idyll.org/blog/2015-what-should-I-teach-about-Jupyter.html#comment-2364536007)
- and I'm still stewing on an answer.  But I'll try to give you my perspective
  here ;)

On Wed, Jan 13, 2016 at 07:06:10PM +0000, Jan Kim wrote:
> Dear All,
> 
> I'm preparing a talk centred on reproducible computing, and as this is
> a topic relevant and valued by SWC I'd like to ask your opinion and
> comments about this.
> 
> One approach I take is checking to which extent my work from 10 - 20
> years ago is reproducible today, and (perhaps not surprisingly) I found
> that having used make, scripts and (relatively) well defined text
> formats turns out to be higly beneficial in this regard.
> 
> This has led me to wonder about some of the tools that currently seem
> to be popular, including on this list, but to me appear unnecessarily
> fat / overloaded and as such to have an uncertain perspective for long
> term reproducibility: [ ...]

I split the tools among several categories (some shared between a tool) --

* what I'm running.

* where all the stuff specific to my project is stored.

* development & interaction & visualization.

* what I need to run it.

---

"What I'm running" is well covered by scripts and Makefiles, and I've been
doing that for 15 years or more.  (My first fully reproducible publication,
bugs and all, is my 2004 paper!)

"Where my project is stored" has always been addressed by some version
control system & more recently public sites, so github now, sourceforge
then.  This just needs to be reasonably self-contained.

"Development" (including interaction and visualization) is the reason I'm so
enthusiastic about Jupyter Notebooks (RStudio is also nice here, especially in
its server form).  I spend a tremendous amount of time poking around data,
trying to understand it; Jupyter is really the first graphical interface I've
found equally suitable for use on local and remote servers.  (I used to use X
win, blech, and then settled primarily on shell & editor - until Jupyter came
along).  Konrad makes an excellent point (link above) that this kind of thing
is *unnecessary* for reproducibility, and may even get in the way.

I mostly use Jupyter and RStudio for generating and exploring visualizations,
in other words.  That's not strictly important for publications, although I
think it *could* be important for the review process. (I think it's better to
use libraries & scripts for non-viz data analysis stuff, in general.)

"What I need to run it" has been much, much more problematic over the 23 years
I've been doing this stuff (pardon my gout ;).  My code can unfortunately
depend on all sorts of UNIX gobbledygook, down to specific (and recent)
versions of gcc. Only with the advent of full virtualization (and now the cloud
and Docker) have I found what I think is an acceptable solution.  The specific
execution environment isn't all that important, be it cloud, Docker or a VM;
it's the idea of being able to *computationally* specify the environment that
is important.  And that is where Docker, in particular, excels.

On the flip side, I've found that I don't really need Docker, or VMs, in
my own work - it's just when I'm conveying it to others that it's useful.
And if you write it properly for one platform, it's pretty easy to get it
working on the other platforms. So it's hard to care overmuch about
the technology flavor of the month ;).  I just think containers have a lot
more legs than VMs and have a lot of potential for my own work in
infrastructure building.

A few other random opinions --

I think the JSON format of the notebook is problematic (although
it's understandable why they went that way).  The RMarkdown format is kinda
nice and simple, and easily parseable.

There are field specific considerations.  Docker & containers more generally
are going to be particularly important in areas where there's a lot of
fast-moving technology; I cannot always get two different components of a
pipeline to compile on the same !#%!! server.

Docker and (I believe) Conda are at least open source.  I'll take it over
all the other crap that's out there ;).

I also don't see a conflict between teaching SWC and using Jupyter as a
specific teaching mechanism, as long as what you're teaching is not just a
Jupyter-specific "here's the magic button you press and it all works".

cheers,
--titus
-- 
C. Titus Brown, [email protected]

_______________________________________________
Discuss mailing list
[email protected]
http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org

Re: [Discuss] Reproducible computing -- some thoughts and questions

Reply via email to