To summarize some further discussions we had on this, with the bottom line that it needs more thought:

The proposal amounts to establishing a third generation of media used for academic publishing:
1. Printed paper (since 1665)
2. Portable document format (PDF) files (since 1990s)
3. Executable documents that contain data, code and text

While there are obvious small-scale solutions for 3., incl. those sketched by Martin and me, doing this well has similar requirements and aspirations for scalability, scope, durability, and time-unlimited support as we take for granted for 1.+2. There are millions of papers published per year across many disciplines of science.

Besides the technical challenges there are economic and organizational ones. The publishing industry should also have a role to play, although of course this a fluid area.

There are relevant existing efforts, incl. this incomplete list:

Binders (documents with containers):
https://mybinder.readthedocs.io/en/latest/examples.html

Jupyter/RStudio interfaces to published datasets and results:
https://wholetale.org/index.html

The Pachyderm framework for running pipelines on archived data, tracks
provenance:
http://www.pachyderm.io/

Code Ocean
https://codeocean.com/


---
Thanks to Martin Morgan and Michael Lawrence for input.


5.11.18 23:17, Martin Morgan scripsit:
This is a continuation of the discussion at

https://support.bioconductor.org/p/114814/#114824

Where Wolfgang asks about "creating a corner in the Bioconductor package ecosystem 
for packages that are only ever supposed to build and check with a single release"

I think this would be quite challenging to implement correctly, for instance 
ensuring that the user of an appropriate version of R can easily install the 
intended dependencies, and what exactly it means for a package to be restricted 
to a single release, e.g., CRAN packages are updated without versioned releases 
[I mean, a user of Bioc 3.7 will get the current version of the CRAN package, 
not the version that was available at the (beginning or end) of the 3.7 
release], so presumably the idea is that there is a snapshot of package 
versions that one requires. This part sounds as much like a job for packrat / 
switchr etc. Maybe 'our' job is to ensure that the appropriate information is 
discoverable?

I took as an example the defunct package BioMedR. Our friend google ("Bioconductor 
BioMedR") took me to the last-known-good landing page (initially by way of a mirror 
in Japan...). The DOI on the (bioconductor.org version) of that page took me to the 
'Removed packages' ( https://bioconductor.org/about/removed-packages/ ) page, which again 
points to the last-known-good page. Likewise https://bioconductor.org/packages/BioMedR . 
The 'In bioc since' tag on the 'last-known-good' page allowed me to find the version of 
Bioconductor where the package was introduced. With some work I can find the AMI 
(https://bioconductor.org/help/bioconductor-cloud-ami/ ) and docker images 
(https://hub.docker.com/r/bioconductor/release_base2/tags/ ) for that release of 
Bioconductor; neither of these would be sufficient for reproducibility (I could get 
relevant Bioconductor package versions simply installing the package from our archive via 
BiocInstaller / BiocManager, but R packages would be more challenging). The package has a 
(impressively extensive!) vignette, but the vignette does not include sessionInfo() so 
one has to do considerable extra work to find the relevant packages. Again maybe packrat 
/ switchr help with this...

I think 'incoming' versions of such packages would go through the usual review 
process, in an attempt to hue to some sort of overall Bioconductor standard of 
quality; the return on this investment would be limited by the short intended 
shelf-life of the package. These packages often have unique considerations, 
too, e.g., 'large' data and long build times, maintainer concerns about when 
the package is released relative to publication, etc. Also of interest would be 
commitment to the actual data storage and transfer costs and to the management 
costs of this type of package, coupled with appropriate consideration on scope 
of the repository (not just the Bioconductor cognoscenti, presumably) and 
advertising of availability e.g., via 
https://www.nature.com/sdata/policies/repositories .

Contemplating this type of package repository suggests a number of small items 
that provide 'cosmetic' improvements to the current situation (e.g., the 
removed-packages page could be organized in a tabular fashion to include from / 
to versions); a more meaningful attempt would probably require efforts to 
embrace packrat / switchr to avoid reinventing the reproducibility wheel, as 
well as commitment to reviewing and managing these packages for their long-term 
contribution. These are certainly noble goals and align with Bioconductor's 
emphasis on reproducibility; is this something that rises to the level of 
securing separate funding?

Martin
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
With thanks in advance-

Wolfgang

-------
Wolfgang Huber
Principal Investigator, EMBL Senior Scientist
European Molecular Biology Laboratory (EMBL)
Heidelberg, Germany

wolfgang.hu...@embl.de
http://www.huber.embl.de

My book with Susan Holmes: http://www.huber.embl.de/msmb






_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to