Damien, Thanks for sharing that paper, really excellent read and I generally agree with all your points there.
A few questions for you: I'm a bit confused about the role GitHub or version control in general plays in this vision. You seem to imply that hallmark examples of reproduciblity create a "pristine repo" that is "just for the publication", but console the reader that using there "everyday repo for the project" is "good enough". Um, I would have thought that a "pristine" repo somewhat defeated the purpose of version control in the first place, and the incentive for this model was largely driven by individuals who either didn't have an "every day github repo", but were merely posting their code on GitHub at the time of publication because they heard that was the right thing to do (this is becoming increasingly common in my field), or because they didn't want to expose the whole version history of the mistakes and dead-ends before the final version. For the purposes of your paper I agree that either approach is fine, but I see no reason to hold the "pristine repo" up as a gold standard. (Sure, curated commit histories are easier to read than a whole lot of [merge] and stuff, but that's beyond the scope here.) More generally, I was glad to see your 'author guidelines' didn't actually mention version control at all. For people already using them, sure, they are an easy way to distribute code. But otherwise having people just post their scripts to a data archiving platform is probably at least as adequate (better for being persistent, if harder to browse etc). I applaud your proposal for the high-level description and detailed documentation of version dependencies, but I feel you gloss over the details far too much here. I think many researchers simply wouldn't know just what to put in these sections. Knowing exactly what dependencies to document and what not to document isn't obvious to anyone, and I believe we could benefit from more discussion as a field about this topic. For instance, many papers seem to feel it is sufficient to say "analyses were performed in R (cite R core team)". Or "In R 3.1.2". Clearly this is too little, but where to draw the line isn't clear to me (do I need compiler details? hardware details? we know they can matter.). Automated tools help here (like sessionInfo() in R) but of course nothing is guarantee. If such lists were really sufficient to reproduce computational environments with adequate success and minimal effort, the tool-heavy creatures of virtual-machines and docker etc would never have taken off in the first place. Figuring out what should be included here and why is not trivial. The situation is equally as complicated for purposes of credit as it is purposes of reproduciblity (e.g. If I use a package that is a thin wrapper around some other software, what do I cite?) The other thing I didn't quite understand was the discussion of log files, mostly because I don't come from a community where we have anything quite like netcdf (or the underlying hdf5). Is the goal here capturing workflow that takes place outside of a scripted environment? (e.g. sounds like you are referring to a collection of command-line utilities here rather than stuff run in a python script)? Or to capture the data used by the scripts? Or to capture the provenance of the data as it passes through various analyses? Perhaps your answer is "all of the above", but I remain a bit confused about what these log files are and what their analog would be outside of the netcdf context. It sounds a little like asking for the (time-stamped) log files of an interactive bash/python/R session rather than a script. More generally, I'm curious what if anything a minimal standard should ask for by way of provenance, or if the script and description of the computational environment is sufficient. Anyway, thanks for your thoughts and apologies for the rambling message. Cheers, Carl On Wed, Jan 13, 2016 at 11:26 AM Damien Irving < [email protected]> wrote: > Hi Jan, > > I have the same concerns regarding many of the "tool heavy" solutions to > reproducibility out there. Here's an essay I wrote recently proposing a > solution that requires no special tooling: > http://journals.ametsoc.org/doi/abs/10.1175/BAMS-D-15-00010.1 > > and here's the accompanying SWC lesson: > http://damienirving.github.io/capstone-oceanography/ > > (I should say that SWC doesn't teach Docker or anything like that - the > shell, command line programs and automation with Make lessons basically > teach people everything they need to know to be reproducible. I basically > wrote the capstone and essay just to make that more explicit, because > sometimes I'm not sure that SWC participants realise that they've learned > everything they need). > > > Cheers, > Damien > > <https://github.com/DamienIrving/CV/blob/master/CV.md> > > > > On Thu, Jan 14, 2016 at 6:06 AM, Jan Kim <[email protected]> wrote: > >> Dear All, >> >> I'm preparing a talk centred on reproducible computing, and as this is >> a topic relevant and valued by SWC I'd like to ask your opinion and >> comments about this. >> >> One approach I take is checking to which extent my work from 10 - 20 >> years ago is reproducible today, and (perhaps not surprisingly) I found >> that having used make, scripts and (relatively) well defined text >> formats turns out to be higly beneficial in this regard. >> >> This has led me to wonder about some of the tools that currently seem >> to be popular, including on this list, but to me appear unnecessarily >> fat / overloaded and as such to have an uncertain perspective for long >> term reproducibility: >> >> * "notebook" systems, and iPython / jupyter in particular: >> - Will the JSON format for saving notebooks be readable / >> executable in the long term? >> - Are these even reproducible in a rigorous sense, considering >> that results can vary depending on the order of executing cells? >> >> * Virtual machines and the recent lightweight "containerising" >> systems (Docker, Conda): They're undoubtedly a blessing for >> reproducibility but >> - what are the long term perspectives of executing their images >> / environments etc.? >> - to which extent is their dependence on backing companies a >> reason for concern? >> >> I hope that comments on these are relevant / interesting to the SWC >> community, in addition to providing me with insights / inspiration, >> and that therefore posting this here is ok. >> >> If you have comments on reproducible scientific computing in general, >> I'm interested as well -- please respond by mailing list or personal >> reply. >> >> Best regards & thanks in advance, Jan >> -- >> +- Jan T. Kim -------------------------------------------------------+ >> | email: [email protected] | >> | WWW: http://www.jtkim.dreamhosters.com/ | >> *-----=< hierarchical systems are for files, not for humans >=-----* >> >> _______________________________________________ >> Discuss mailing list >> [email protected] >> >> http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org >> > > _______________________________________________ > Discuss mailing list > [email protected] > > http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org -- http://carlboettiger.info
_______________________________________________ Discuss mailing list [email protected] http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org
