I love the question! Konrad Hinsen has made similar comments several times on my blog - latest here (http://ivory.idyll.org/blog/2015-what-should-I-teach-about-Jupyter.html#comment-2364536007) - and I'm still stewing on an answer. But I'll try to give you my perspective here ;)
On Wed, Jan 13, 2016 at 07:06:10PM +0000, Jan Kim wrote: > Dear All, > > I'm preparing a talk centred on reproducible computing, and as this is > a topic relevant and valued by SWC I'd like to ask your opinion and > comments about this. > > One approach I take is checking to which extent my work from 10 - 20 > years ago is reproducible today, and (perhaps not surprisingly) I found > that having used make, scripts and (relatively) well defined text > formats turns out to be higly beneficial in this regard. > > This has led me to wonder about some of the tools that currently seem > to be popular, including on this list, but to me appear unnecessarily > fat / overloaded and as such to have an uncertain perspective for long > term reproducibility: [ ...] I split the tools among several categories (some shared between a tool) -- * what I'm running. * where all the stuff specific to my project is stored. * development & interaction & visualization. * what I need to run it. --- "What I'm running" is well covered by scripts and Makefiles, and I've been doing that for 15 years or more. (My first fully reproducible publication, bugs and all, is my 2004 paper!) "Where my project is stored" has always been addressed by some version control system & more recently public sites, so github now, sourceforge then. This just needs to be reasonably self-contained. "Development" (including interaction and visualization) is the reason I'm so enthusiastic about Jupyter Notebooks (RStudio is also nice here, especially in its server form). I spend a tremendous amount of time poking around data, trying to understand it; Jupyter is really the first graphical interface I've found equally suitable for use on local and remote servers. (I used to use X win, blech, and then settled primarily on shell & editor - until Jupyter came along). Konrad makes an excellent point (link above) that this kind of thing is *unnecessary* for reproducibility, and may even get in the way. I mostly use Jupyter and RStudio for generating and exploring visualizations, in other words. That's not strictly important for publications, although I think it *could* be important for the review process. (I think it's better to use libraries & scripts for non-viz data analysis stuff, in general.) "What I need to run it" has been much, much more problematic over the 23 years I've been doing this stuff (pardon my gout ;). My code can unfortunately depend on all sorts of UNIX gobbledygook, down to specific (and recent) versions of gcc. Only with the advent of full virtualization (and now the cloud and Docker) have I found what I think is an acceptable solution. The specific execution environment isn't all that important, be it cloud, Docker or a VM; it's the idea of being able to *computationally* specify the environment that is important. And that is where Docker, in particular, excels. On the flip side, I've found that I don't really need Docker, or VMs, in my own work - it's just when I'm conveying it to others that it's useful. And if you write it properly for one platform, it's pretty easy to get it working on the other platforms. So it's hard to care overmuch about the technology flavor of the month ;). I just think containers have a lot more legs than VMs and have a lot of potential for my own work in infrastructure building. A few other random opinions -- I think the JSON format of the notebook is problematic (although it's understandable why they went that way). The RMarkdown format is kinda nice and simple, and easily parseable. There are field specific considerations. Docker & containers more generally are going to be particularly important in areas where there's a lot of fast-moving technology; I cannot always get two different components of a pipeline to compile on the same !#%!! server. Docker and (I believe) Conda are at least open source. I'll take it over all the other crap that's out there ;). I also don't see a conflict between teaching SWC and using Jupyter as a specific teaching mechanism, as long as what you're teaching is not just a Jupyter-specific "here's the magic button you press and it all works". cheers, --titus -- C. Titus Brown, [email protected] _______________________________________________ Discuss mailing list [email protected] http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org
