Thanks, Levi, nice slides. In case it is a helpful perspective, I'll try to share what I recall of my thought process as author of phyloseq. And I should preface by admitting that I've been embarrassed by this major development oversight for some years now.
At the beginning of 2011 I was a new postdoc, heavy R user, completely new to R development, and in a "cousin" field (microbiome & bacterial genomics) that had virtually no presence in BioC. Some of the recommendations you've made in this slide deck were not available then, but admittedly, I might have missed them even if they were. I had access to training in base R devel (Hadley's bootcamp, John Chambers' course at Stanford), but a lot of resources in BioC were still very new to me. If you can believe it, my original idea for the phyloseq-class was even worse! Valerie Obenchain was great and patiently talked me into a better solution, but somehow she also missed that I might have re-used available classes and avoided some unnecessary implementation and maintenance. What additional recommendations can I make with the benefit of hindsight that have not been mentioned in this thread? - Hopefully it is obvious from my description, and also what I imagine to be Levi's motivation for making the slide deck, but somehow new eager developers are missing out on this great infrastructure and it isn't because they want to re-implement core stuff. I sure didn't! I simply didn't know what was there or best practices for BioC. *A "beginner's guide to BioC package development"* would have been at the top of my list of things to read back then. - It isn't that I didn't read other established packages. I did. However, a lot of core BioC tools had gene-expression specific names even for data classes that were not intrinsically gene expression (e.g. it's actually a matrix, or related tables) -- and I'm happy many of these now use more general names like "experiment" or "row". The old names signaled to me "this isn't for you". And I naively, ignorantly, accepted that at mostly face value. - Conversely, sometimes not-inheriting methods is a feature, because it protects users from doing something that is great and appropriate for one domain (gene expression) but totally irrational in another (microbiome). I'm not saying my original implementation made great nuanced decisions about this -- it has many trappings of a new developer -- but I did have some pretty naive users in mind with phyloseq, for whom navigating legacy methods and method names from other domain(s) was expected to be a hurdle. Curious to hear thoughts on this. - There actually *still isn't core support for evolutionary trees in BioC* (as mentioned by Joe Paulson and Ben Callahan in other threads). One of phyloseq's key contributions was to leverage the fantastic representation of trees implemented in the CRAN package "ape" in order to support analysis techniques popular among microbiome researchers that require a phylogenetic tree. The integration in the phyloseq-class and ape is necessarily pretty deep, including certain row operations. Users also needed a familiar and simple R interface to manipulate that composite object despite the complex hierarchical relationship among rows. Correct me if I'm wrong, but I think there is still no core BioC support for representing tree-like or bio-taxonomy-like hierarchy among rows in a SummarizedExperiment, or equivalent; and consequently certain row operations may have to be modified more deeply than usual if we were to re-implement phyloseq "the right way". I'd love to hear thoughts on this. Even though phyloseq is at the receiving end, I think the criticism is fair, and I want current and future new BioC contributors to not re-make my mistakes circa 2011-12. I'm happy to help if I can. Cheers, and thanks for the interesting, collegial thread. Joey --- --- "Joey" Paul J. McMurdie II Sent from Gmail On Wed, Oct 18, 2017 at 11:46 AM, Levi Waldron <lwaldron.resea...@gmail.com> wrote: > On Wed, Oct 18, 2017 at 10:26 AM, Ryan Thompson <r...@thompsonclan.org> >> wrote: >> >>> I think the main reason for reusing/subclassing core classes that users >>> can >>> appreciate is that it makes it much easier for users to integrate >>> multiple >>> packages into a single workflow. Only the most basic of pipelines uses >>> just >>> a single Bioconductor package. For instance, an "edgeR" pipeline >>> obviously >>> uses the edgeR package, but it likely also uses several other packages, >>> like sva, RUV, variancePartition, etc. The more these different packages >>> operate on the same core data structures, the less work the user has to >>> do >>> to use them together. And to bring that back around to an incentive for >>> developers, making your package interoperate with other packages more >>> easily means that users will be more likely to use your package. >>> >> > My impression is that the interoperability argument may already be more > widely appreciated, because in the pipeline example you can have several > packages operating on the same data class. It seems less obvious when you > are doing something different that requires defining a new class, why you > should extend an existing class to meet your needs. Although I guess your > point extends to interoperability with other packages providing methods for > the parent class, and the ability to use coercion methods defined for the > parent class, which I didn't mention... > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel