To summarize some further discussions we had on this, with the bottom
line that it needs more thought:
The proposal amounts to establishing a third generation of media used
for academic publishing:
1. Printed paper (since 1665)
2. Portable document format (PDF) files (since 1990s)
3. Executable documents that contain data, code and text
While there are obvious small-scale solutions for 3., incl. those
sketched by Martin and me, doing this well has similar requirements and
aspirations for scalability, scope, durability, and time-unlimited
support as we take for granted for 1.+2. There are millions of papers
published per year across many disciplines of science.
Besides the technical challenges there are economic and organizational
ones. The publishing industry should also have a role to play, although
of course this a fluid area.
There are relevant existing efforts, incl. this incomplete list:
Binders (documents with containers):
https://mybinder.readthedocs.io/en/latest/examples.html
Jupyter/RStudio interfaces to published datasets and results:
https://wholetale.org/index.html
The Pachyderm framework for running pipelines on archived data, tracks
provenance:
http://www.pachyderm.io/
Code Ocean
https://codeocean.com/
---
Thanks to Martin Morgan and Michael Lawrence for input.
5.11.18 23:17, Martin Morgan scripsit:
This is a continuation of the discussion at
https://support.bioconductor.org/p/114814/#114824
Where Wolfgang asks about "creating a corner in the Bioconductor package ecosystem
for packages that are only ever supposed to build and check with a single release"
I think this would be quite challenging to implement correctly, for instance
ensuring that the user of an appropriate version of R can easily install the
intended dependencies, and what exactly it means for a package to be restricted
to a single release, e.g., CRAN packages are updated without versioned releases
[I mean, a user of Bioc 3.7 will get the current version of the CRAN package,
not the version that was available at the (beginning or end) of the 3.7
release], so presumably the idea is that there is a snapshot of package
versions that one requires. This part sounds as much like a job for packrat /
switchr etc. Maybe 'our' job is to ensure that the appropriate information is
discoverable?
I took as an example the defunct package BioMedR. Our friend google ("Bioconductor
BioMedR") took me to the last-known-good landing page (initially by way of a mirror
in Japan...). The DOI on the (bioconductor.org version) of that page took me to the
'Removed packages' ( https://bioconductor.org/about/removed-packages/ ) page, which again
points to the last-known-good page. Likewise https://bioconductor.org/packages/BioMedR .
The 'In bioc since' tag on the 'last-known-good' page allowed me to find the version of
Bioconductor where the package was introduced. With some work I can find the AMI
(https://bioconductor.org/help/bioconductor-cloud-ami/ ) and docker images
(https://hub.docker.com/r/bioconductor/release_base2/tags/ ) for that release of
Bioconductor; neither of these would be sufficient for reproducibility (I could get
relevant Bioconductor package versions simply installing the package from our archive via
BiocInstaller / BiocManager, but R packages would be more challenging). The package has a
(impressively extensive!) vignette, but the vignette does not include sessionInfo() so
one has to do considerable extra work to find the relevant packages. Again maybe packrat
/ switchr help with this...
I think 'incoming' versions of such packages would go through the usual review
process, in an attempt to hue to some sort of overall Bioconductor standard of
quality; the return on this investment would be limited by the short intended
shelf-life of the package. These packages often have unique considerations,
too, e.g., 'large' data and long build times, maintainer concerns about when
the package is released relative to publication, etc. Also of interest would be
commitment to the actual data storage and transfer costs and to the management
costs of this type of package, coupled with appropriate consideration on scope
of the repository (not just the Bioconductor cognoscenti, presumably) and
advertising of availability e.g., via
https://www.nature.com/sdata/policies/repositories .
Contemplating this type of package repository suggests a number of small items
that provide 'cosmetic' improvements to the current situation (e.g., the
removed-packages page could be organized in a tabular fashion to include from /
to versions); a more meaningful attempt would probably require efforts to
embrace packrat / switchr to avoid reinventing the reproducibility wheel, as
well as commitment to reviewing and managing these packages for their long-term
contribution. These are certainly noble goals and align with Bioconductor's
emphasis on reproducibility; is this something that rises to the level of
securing separate funding?
Martin
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
With thanks in advance-
Wolfgang
-------
Wolfgang Huber
Principal Investigator, EMBL Senior Scientist
European Molecular Biology Laboratory (EMBL)
Heidelberg, Germany
wolfgang.hu...@embl.de
http://www.huber.embl.de
My book with Susan Holmes: http://www.huber.embl.de/msmb
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel