Hi Elizabeth, thanks for your thoughts. I particularly like your idea of saving the plain Python code, rather than the notebook, as the basis of reproducibility, and looking at your criteria again, I find that the plain code meets all three better than the notebook.
Some more detailed comments inline below. On Wed, Jan 13, 2016 at 01:54:24PM -0600, E.W. wrote: > Hi Jan, > > I love these kinds of questions. The library world has been trying to > tackle many of these things for years. > > I'd like to unpack some of the discussion here a little to perhaps help > facilitate discussion. > > Taking a look at your Jupyter question, there are three questions that I > see: > > 1. Can the file be opened? > > * These are plain text files, so this is an ideal way to be stored. > > 2. Can the file be read? > > * The question is how long will JSON parsers be around and supported. > Should be safe into the medium-term future. Even if it goes away in the > long term, it wouldn't be impossible to write your own JSON parser to > recover things. JSON is a well documented and open standard format, so it > is a great long-term storage solution for this type of data. Yes, JSON is simple enough to assume that someone in the future investigating it as an "archaeologist" could knock up a parser fairly quickly and easily. However, the flipside of JSON's simplicity is that this would then almost inevitably be followed by a session of reverse engineering / educated guessing what the individual fields are. >From a perspective of human (rather than machine) readability it's also a bit less than ideal that notebook JSONs can contain binary blobs (e.g. images / graphs, these are stored in base64 or similar). On the other hand, these embedded images can be useful to the "archaeologist" for assessing whether their reproduction attempts are on the right track. > 3. Can the programming contents be executed? > > * You bring up the execution order, which is a great question. The > execution count metadata is saved, so theoretically the original execution > order can be recovered. The other issue is here one of the various > versions and dependencies within the creator's Python environment. The > notebook metadata does store the python version that I use, but nothing > about the versions of pandas or other packages that I import. Long term > preservation of the executableness would require copies of all the > executing software. > > Thus, we have a situation where the original file content is completely > accessible and preservable into the far future. However, the question is > if the desired preservation object is the script content, the output, > and/or the act of execution. We certainly won't be able to verify the > output results if we are unable to execute the code. > > I always export my Jupyter notebooks as flat .py files as my preservation > copies when I'm packing up a project. This makes recovery on a computer > without ipynb or Jupyter significantly less annoying. This also takes the > github/nbviewer dependency out for human readable methods of viewing the > files. Again -- I very much like this approach. As a further step I'd consider "tidying up" the exported code to weed out any code that may have been executed e.g. for monitoring / debugging purposes only, so as to produce a minimal script that reproduces the result. This may need weighing against risks of inadvertently altering the result, though -- which could be addressed by either using the script (rather than the notebook) for generating the final results, or by committing the script immediately after exporting it, and then committing the tidied-up version. Best regards, Jan > Elizabeth > @elliewix > > On Wed, Jan 13, 2016 at 1:06 PM, Jan Kim <[email protected]> wrote: > > > Dear All, > > > > I'm preparing a talk centred on reproducible computing, and as this is > > a topic relevant and valued by SWC I'd like to ask your opinion and > > comments about this. > > > > One approach I take is checking to which extent my work from 10 - 20 > > years ago is reproducible today, and (perhaps not surprisingly) I found > > that having used make, scripts and (relatively) well defined text > > formats turns out to be higly beneficial in this regard. > > > > This has led me to wonder about some of the tools that currently seem > > to be popular, including on this list, but to me appear unnecessarily > > fat / overloaded and as such to have an uncertain perspective for long > > term reproducibility: > > > > * "notebook" systems, and iPython / jupyter in particular: > > - Will the JSON format for saving notebooks be readable / > > executable in the long term? > > - Are these even reproducible in a rigorous sense, considering > > that results can vary depending on the order of executing cells? > > > > * Virtual machines and the recent lightweight "containerising" > > systems (Docker, Conda): They're undoubtedly a blessing for > > reproducibility but > > - what are the long term perspectives of executing their images > > / environments etc.? > > - to which extent is their dependence on backing companies a > > reason for concern? > > > > I hope that comments on these are relevant / interesting to the SWC > > community, in addition to providing me with insights / inspiration, > > and that therefore posting this here is ok. > > > > If you have comments on reproducible scientific computing in general, > > I'm interested as well -- please respond by mailing list or personal > > reply. > > > > Best regards & thanks in advance, Jan > > -- > > +- Jan T. Kim -------------------------------------------------------+ > > | email: [email protected] | > > | WWW: http://www.jtkim.dreamhosters.com/ | > > *-----=< hierarchical systems are for files, not for humans >=-----* > > > > _______________________________________________ > > Discuss mailing list > > [email protected] > > > > http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org > > -- +- Jan T. Kim -------------------------------------------------------+ | email: [email protected] | | WWW: http://www.jtkim.dreamhosters.com/ | *-----=< hierarchical systems are for files, not for humans >=-----* _______________________________________________ Discuss mailing list [email protected] http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org
