Re: [Bioc-devel] Changes to the SummarizedExperiment Class
Oh, I don't disagree. Perhaps the two problems can be addressed simultaneously by 1) deciding on what contracts a multi-assay container can/would demand to be useful 2) calling it something besides SummarizedExperiment, say, ExperimentCollection Then the SE API could stay the same as it is (which is already very useful) and progress could be sought in the offshoot (ExperimentCollection or whatever) without breaking things that rely on SE. Just off the top of my head, a most generically useful container for DNA methylation CNV data (which can of course be called from the same assay) is Kasper JP's GenomicRatioSet, which already has some weird quirks for eSet backwards compatibility. (e.g. sampleNames(x) works, but sampleNames(x) - does not work; pData(x) calls colData(x); fData(x) calls rowData(x)) There are little niggles that I should probably just send in a patch for, but a cleaner overall container would be better, if for no other reason than the aforementioned ability to easily experiment with imputation. An approach that I've been using is to stuff the SNPs, CNV (as GRanges) and mRNA/miRNA (as a matrix) data into exptData(SE). This is... somewhat less than optimal, especially when subsetting. But it does suggest that I could define a coercion from the current rambling wreck into a nice clean new class/API (ExperimentCollection or whatever) and I'll bet other package authors could, too. The presence of a GRangesFrame would then be handy for returning a given assay's results, so that the user could be blissfully ignorant of the storage backing (ff, BigMatrix, Matrix, matrix, Rle, whatever) but not lose the data management advantages of a SummarizedExperiment. JMHO Statistics is the grammar of science. Karl Pearson http://en.wikipedia.org/wiki/The_Grammar_of_Science On Wed, Mar 4, 2015 at 6:40 AM, Vincent Carey st...@channing.harvard.edu wrote: I am a bit concerned about any major alterations to the SummarizedExperiment API. We have two papers and plenty of working code that use it in meaningful ways. Effort required to keep new formulations back-compatible as well as bug-free has to be weighed seriously. I agree that the name is not ideal. We are learning as we go. Seems to make sense to start with the contracts we want the instances of a class to satisfy. I have long felt that X[i, j] idiom is one users and developers should be comfortable with, even insist on, and for consistency with matrix operations idiom, it should work in a natural way for numeric indexing. This seems like an important constraint. subsetBy* is a useful idiom, but it is conceivable that we would adopt filter() for row-oriented selections and select() for column-oriented selections. Do we have to make any special design considerations to allow very smooth interoperation with out-of-memory resources for certain components for developers who want to allow this? We should have a reasonable way to get data on what is out there, what is used, how it is most effectively used. What's the SE API? Is it well-adapted to requirements of DESeq2? Other killer packages that use/don't use it? Even getting data on the formal API for a class is not all that familiar. And if folks are writing non-S4 interfaces (i.e., naked functions) we have no way of identifying them. See below for one way of discovering the API for SummarizedExperiment. In summary, I think we have to be careful about overdesigning too early. Getting clear on contracts seems the best way to ensure reuse, and we really want that so that reliability is continually assessed. My sense is that it is good to give developers something they'll gladly extend, not necessarily reuse directly. So we don't have to have broad consensus on class details, but on the minimal abstraction and on obligatory tests on its basic implementation. methods(class=SummarizedExperiment) # perhaps an obsolete version of methods cataloguer by MTM DataFrame with 76 rows and 3 columns generic signature package character character character 1 [ x=SummarizedExperiment, i=ANY, j=ANY, drop=ANY base 2 [ x=SummarizedExperiment, i=ANY, j=missing, value=ANY base 3 [ x=SummarizedExperiment, i=ANY, j=missing base 4[- x=SummarizedExperiment, i=ANY, j=ANY, value=SummarizedExperiment base 5 assay x=SummarizedExperiment, i=character GenomicRanges ... ... ... ... 72 updateObject object=SummarizedExperiment BiocGenerics 73values x=SummarizedExperiment S4Vectors 74 values- x=SummarizedExperiment S4Vectors 75 width x=SummarizedExperiment BiocGenerics 76 width- x=SummarizedExperiment BiocGenerics On Wed, Mar 4, 2015 at 8:32 AM, Hector Corrada Bravo
Re: [Bioc-devel] Changes to the SummarizedExperiment Class
I think we need to make sure that there are enough benefits of something like GRangesFrame before we introduce yet another complicated and overlapping data structure into the framework. Prior to summarization, the ranges seem primary, after summarization, it may often make sense for them to be secondary. But I'm just not sure what we gain from a new data structure. On Wed, Mar 4, 2015 at 12:28 AM, Hervé Pagès hpa...@fredhutch.org wrote: GRangesFrame is an interesting idea and I gave it some thoughts. There is this nice symmetry between GRanges and GRangesFrame: - GRanges = a naked GRanges + a DataFrame accessible via mcols() - GRangesFrame = a DataFrame + a naked GRanges accessible via some accessor (e.g. rowRanges()) So GRanges and GRangesFrame are equivalent in terms of what they can hold, but different in terms of API: the former has the ranges API as primary API and the DataFrame API on its mcols() component, and the latter has the DataFrame API as primary API and the ranges API on its rowRanges() component. Nice switch! What does this API switch bring us? A GRangesFrame object is now an object that fully behaves like a DataFrame and people can also perform range-based operations on its rowRanges() component. Here is what I'm afraid is going to happen: people will also want to be able to perform range-based operations *directly* on these objects, i.e. without having to call rowRanges() first. So for example when they do subsetByOverlaps(), subsetting happens vertically. Also the Hits object returned by findOverlaps() would contain row indices. Problem with this is that these objects now start to suffer from the dual personality syndrome. For example, it's not clear anymore what their length should be. Strictly speaking it should be their number of columns (that's what the length of a DataFrame is), but the ranges API that we're trying to put on them also makes them feel like vectors along the vertical dimension so it also feels that their length should be their number of rows. Same thing with 1D subsetting. Why does it subset the columns and not the rows? Most people are now confused. It's interesting to note that the same thing happens with GRanges objects, but in the opposite direction: people wish they could do DataFrame operations directly on them without calling mcols() first. But in order to preserve the good health of GRanges objects, we've not done that (except for $, a shortcut for mcols(x)$, the pressure was just too strong). H. On 03/03/2015 04:35 PM, Michael Lawrence wrote: Should be possible for the annotations to be of any type, as long as they satisfy a simple contract of NROW() and 2D [. Then, you could have a DataFrame, GRanges, or whatever in there. But it would be nice to have a special class for the container with range information. The contract for the range annotation would be to have a granges() method. I agree it would be nice if there was a way with the methods package to easily assert such contracts. For example, one could define an interface with a set of generics (and optionally the relevant position in the generic signature). Then, once all of the methods have been assigned for a particular class, it is made to inherit from that contract class. There are lots of gotchas though. Not sure how useful it would be in practice. On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty haverty.pe...@gene.com wrote: There are some nice similarities in these new imaginary types. A GRangesFrame is a list of dimensionally identical things (columns) and some row meta-data (the GRanges). The SE-like object is similarly a list of dimensionally like things (matrices, RleDataFrames, BigMatrix objects, HDF5-backed things) with some row meta-data (a DataFrame or GRangesFrame). Elegant? Maybe they would actually be relatives in the class tree. I wonder if this kind of thing would be easier if we had Java-style Interfaces or duck-typing. The x slot of y holds something that implements this set of methods ... Oh, and kinda apropos, the genoset class will probably go away or become an extension to this new SE-like thing. The extra stuff that comes along with genoset will still be available. Pete Peter M. Haverty, Ph.D. Genentech, Inc. phave...@gene.com On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. tim.tri...@gmail.com wrote: This. It would be damned near perfect as a return value for assays coming out of an object that held several such assays at several time points in a population, where there are both assay-wise and covariate-wise holes that could nonetheless be usefully imputed across assays. Statistics is the grammar of science. Karl Pearson http://en.wikipedia.org/wiki/The_Grammar_of_Science On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty haverty.pe...@gene.com wrote: I still think GRanges should be a subclass of DataFrame, which would make
Re: [Bioc-devel] Changes to the SummarizedExperiment Class
What complexity? The Nature Methods paper laid it out: for most people, most of the time, use an SE. That way, the organization of metadata and covariates is enforced for you, like an ExpressionSet (another winning data structure) but without its baggage. Maybe the Summarized in the name isn't such a bad idea after all. AfterTheDataMungingIsDone doesn't have the same ring to it. What would be equally awesome IMHO is to have a similarly unifying structure for integrative work. But that's just, like, my opinion. I've taken a whack at it when I knew even less than I do now, and it's hard. However, data management for expression arrays was hard, too. If I'm not mistaken, there were benefits to solving that data management problem, too. Some sort of a software project. I think it was called MADMAN. I'll have to go look. ;-) Statistics is the grammar of science. Karl Pearson http://en.wikipedia.org/wiki/The_Grammar_of_Science On Wed, Mar 4, 2015 at 10:03 AM, Peter Haverty haverty.pe...@gene.com wrote: Michael has a good point. The complexity of the BioC universe of classes hurts our ability to attract new users. More classes would be a minus there ... but a small set of common, explicit APIs would simplify things. Rectangular things implement the matrix Interface. :-) Deprecating old stuff, like eSet, might help more than it hurts, on the simplicity front. P.S. apropos of understanding this universe of classes, I *love* the methods(class=x) thing Vincent mentioned. Pete Peter M. Haverty, Ph.D. Genentech, Inc. phave...@gene.com On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence lawrence.mich...@gene.com wrote: I think we need to make sure that there are enough benefits of something like GRangesFrame before we introduce yet another complicated and overlapping data structure into the framework. Prior to summarization, the ranges seem primary, after summarization, it may often make sense for them to be secondary. But I'm just not sure what we gain from a new data structure. On Wed, Mar 4, 2015 at 12:28 AM, Hervé Pagès hpa...@fredhutch.org wrote: GRangesFrame is an interesting idea and I gave it some thoughts. There is this nice symmetry between GRanges and GRangesFrame: - GRanges = a naked GRanges + a DataFrame accessible via mcols() - GRangesFrame = a DataFrame + a naked GRanges accessible via some accessor (e.g. rowRanges()) So GRanges and GRangesFrame are equivalent in terms of what they can hold, but different in terms of API: the former has the ranges API as primary API and the DataFrame API on its mcols() component, and the latter has the DataFrame API as primary API and the ranges API on its rowRanges() component. Nice switch! What does this API switch bring us? A GRangesFrame object is now an object that fully behaves like a DataFrame and people can also perform range-based operations on its rowRanges() component. Here is what I'm afraid is going to happen: people will also want to be able to perform range-based operations *directly* on these objects, i.e. without having to call rowRanges() first. So for example when they do subsetByOverlaps(), subsetting happens vertically. Also the Hits object returned by findOverlaps() would contain row indices. Problem with this is that these objects now start to suffer from the dual personality syndrome. For example, it's not clear anymore what their length should be. Strictly speaking it should be their number of columns (that's what the length of a DataFrame is), but the ranges API that we're trying to put on them also makes them feel like vectors along the vertical dimension so it also feels that their length should be their number of rows. Same thing with 1D subsetting. Why does it subset the columns and not the rows? Most people are now confused. It's interesting to note that the same thing happens with GRanges objects, but in the opposite direction: people wish they could do DataFrame operations directly on them without calling mcols() first. But in order to preserve the good health of GRanges objects, we've not done that (except for $, a shortcut for mcols(x)$, the pressure was just too strong). H. On 03/03/2015 04:35 PM, Michael Lawrence wrote: Should be possible for the annotations to be of any type, as long as they satisfy a simple contract of NROW() and 2D [. Then, you could have a DataFrame, GRanges, or whatever in there. But it would be nice to have a special class for the container with range information. The contract for the range annotation would be to have a granges() method. I agree it would be nice if there was a way with the methods package to easily assert such contracts. For example, one could define an interface with a set of generics (and optionally the relevant position in the generic signature). Then, once all of the methods have been assigned for a particular class, it is made to inherit from that
Re: [Bioc-devel] Changes to the SummarizedExperiment Class
On Wed, Mar 4, 2015 at 12:01 PM, Robert Castelo robert.cast...@upf.edu wrote: some of the goals behind this discussion are IMO similar to the ones for biocMultiAssay: https://github.com/vjcitn/biocMultiAssay maybe Vince can confirm. It is true that there are connections between the concerns But the way I see it, the container design we are talking about in this thread addresses the management of a fixed common assay type over a fixed set of samples. The biocMultiAssay deals with the management of multiple assay types over multiple samples, with possible disparities in sample sets over the different assay types. robert. On 03/04/2015 05:16 PM, Tim Triche, Jr. wrote: Oh, I don't disagree. Perhaps the two problems can be addressed simultaneously by 1) deciding on what contracts a multi-assay container can/would demand to be useful 2) calling it something besides SummarizedExperiment, say, ExperimentCollection Then the SE API could stay the same as it is (which is already very useful) and progress could be sought in the offshoot (ExperimentCollection or whatever) without breaking things that rely on SE. Just off the top of my head, a most generically useful container for DNA methylation CNV data (which can of course be called from the same assay) is Kasper JP's GenomicRatioSet, which already has some weird quirks for eSet backwards compatibility. (e.g. sampleNames(x) works, but sampleNames(x)- does not work; pData(x) calls colData(x); fData(x) calls rowData(x)) There are little niggles that I should probably just send in a patch for, but a cleaner overall container would be better, if for no other reason than the aforementioned ability to easily experiment with imputation. An approach that I've been using is to stuff the SNPs, CNV (as GRanges) and mRNA/miRNA (as a matrix) data into exptData(SE). This is... somewhat less than optimal, especially when subsetting. But it does suggest that I could define a coercion from the current rambling wreck into a nice clean new class/API (ExperimentCollection or whatever) and I'll bet other package authors could, too. The presence of a GRangesFrame would then be handy for returning a given assay's results, so that the user could be blissfully ignorant of the storage backing (ff, BigMatrix, Matrix, matrix, Rle, whatever) but not lose the data management advantages of a SummarizedExperiment. JMHO Statistics is the grammar of science. Karl Pearsonhttp://en.wikipedia.org/wiki/The_Grammar_of_Science On Wed, Mar 4, 2015 at 6:40 AM, Vincent Careyst...@channing.harvard.edu wrote: I am a bit concerned about any major alterations to the SummarizedExperiment API. We have two papers and plenty of working code that use it in meaningful ways. Effort required to keep new formulations back-compatible as well as bug-free has to be weighed seriously. I agree that the name is not ideal. We are learning as we go. Seems to make sense to start with the contracts we want the instances of a class to satisfy. I have long felt that X[i, j] idiom is one users and developers should be comfortable with, even insist on, and for consistency with matrix operations idiom, it should work in a natural way for numeric indexing. This seems like an important constraint. subsetBy* is a useful idiom, but it is conceivable that we would adopt filter() for row-oriented selections and select() for column-oriented selections. Do we have to make any special design considerations to allow very smooth interoperation with out-of-memory resources for certain components for developers who want to allow this? We should have a reasonable way to get data on what is out there, what is used, how it is most effectively used. What's the SE API? Is it well-adapted to requirements of DESeq2? Other killer packages that use/don't use it? Even getting data on the formal API for a class is not all that familiar. And if folks are writing non-S4 interfaces (i.e., naked functions) we have no way of identifying them. See below for one way of discovering the API for SummarizedExperiment. In summary, I think we have to be careful about overdesigning too early. Getting clear on contracts seems the best way to ensure reuse, and we really want that so that reliability is continually assessed. My sense is that it is good to give developers something they'll gladly extend, not necessarily reuse directly. So we don't have to have broad consensus on class details, but on the minimal abstraction and on obligatory tests on its basic implementation. methods(class=SummarizedExperiment) # perhaps an obsolete version of methods cataloguer by MTM DataFrame with 76 rows and 3 columns generic signature package character charactercharacter 1 [ x=SummarizedExperiment, i=ANY, j=ANY, drop=ANY base 2 [
Re: [Bioc-devel] Changes to the SummarizedExperiment Class
Michael has a good point. The complexity of the BioC universe of classes hurts our ability to attract new users. More classes would be a minus there ... but a small set of common, explicit APIs would simplify things. Rectangular things implement the matrix Interface. :-) Deprecating old stuff, like eSet, might help more than it hurts, on the simplicity front. P.S. apropos of understanding this universe of classes, I *love* the methods(class=x) thing Vincent mentioned. Pete Peter M. Haverty, Ph.D. Genentech, Inc. phave...@gene.com On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence lawrence.mich...@gene.com wrote: I think we need to make sure that there are enough benefits of something like GRangesFrame before we introduce yet another complicated and overlapping data structure into the framework. Prior to summarization, the ranges seem primary, after summarization, it may often make sense for them to be secondary. But I'm just not sure what we gain from a new data structure. On Wed, Mar 4, 2015 at 12:28 AM, Herv� Pag�s hpa...@fredhutch.org wrote: GRangesFrame is an interesting idea and I gave it some thoughts. There is this nice symmetry between GRanges and GRangesFrame: - GRanges = a naked GRanges + a DataFrame accessible via mcols() - GRangesFrame = a DataFrame + a naked GRanges accessible via some accessor (e.g. rowRanges()) So GRanges and GRangesFrame are equivalent in terms of what they can hold, but different in terms of API: the former has the ranges API as primary API and the DataFrame API on its mcols() component, and the latter has the DataFrame API as primary API and the ranges API on its rowRanges() component. Nice switch! What does this API switch bring us? A GRangesFrame object is now an object that fully behaves like a DataFrame and people can also perform range-based operations on its rowRanges() component. Here is what I'm afraid is going to happen: people will also want to be able to perform range-based operations *directly* on these objects, i.e. without having to call rowRanges() first. So for example when they do subsetByOverlaps(), subsetting happens vertically. Also the Hits object returned by findOverlaps() would contain row indices. Problem with this is that these objects now start to suffer from the dual personality syndrome. For example, it's not clear anymore what their length should be. Strictly speaking it should be their number of columns (that's what the length of a DataFrame is), but the ranges API that we're trying to put on them also makes them feel like vectors along the vertical dimension so it also feels that their length should be their number of rows. Same thing with 1D subsetting. Why does it subset the columns and not the rows? Most people are now confused. It's interesting to note that the same thing happens with GRanges objects, but in the opposite direction: people wish they could do DataFrame operations directly on them without calling mcols() first. But in order to preserve the good health of GRanges objects, we've not done that (except for $, a shortcut for mcols(x)$, the pressure was just too strong). H. On 03/03/2015 04:35 PM, Michael Lawrence wrote: Should be possible for the annotations to be of any type, as long as they satisfy a simple contract of NROW() and 2D [. Then, you could have a DataFrame, GRanges, or whatever in there. But it would be nice to have a special class for the container with range information. The contract for the range annotation would be to have a granges() method. I agree it would be nice if there was a way with the methods package to easily assert such contracts. For example, one could define an interface with a set of generics (and optionally the relevant position in the generic signature). Then, once all of the methods have been assigned for a particular class, it is made to inherit from that contract class. There are lots of gotchas though. Not sure how useful it would be in practice. On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty haverty.pe...@gene.com wrote: There are some nice similarities in these new imaginary types. A GRangesFrame is a list of dimensionally identical things (columns) and some row meta-data (the GRanges). The SE-like object is similarly a list of dimensionally like things (matrices, RleDataFrames, BigMatrix objects, HDF5-backed things) with some row meta-data (a DataFrame or GRangesFrame). Elegant? Maybe they would actually be relatives in the class tree. I wonder if this kind of thing would be easier if we had Java-style Interfaces or duck-typing. The x slot of y holds something that implements this set of methods ... Oh, and kinda apropos, the genoset class will probably go away or become an extension to this new SE-like thing. The extra stuff that comes along with genoset will still be available. Pete Peter M. Haverty, Ph.D. Genentech,
Re: [Bioc-devel] Changes to the SummarizedExperiment Class
My response was meant to address this: 1) fixed-dimension, fixed sample set is a solved problem, and SE is that solution. 2) multi-assay, holes across samples remains an ugly thorny problem, maybe needs a new API So why not keep SE as stable as possible, and dump all the explosive changes into the latter? Statistics is the grammar of science. Karl Pearson http://en.wikipedia.org/wiki/The_Grammar_of_Science On Wed, Mar 4, 2015 at 9:12 AM, Vincent Carey st...@channing.harvard.edu wrote: On Wed, Mar 4, 2015 at 12:01 PM, Robert Castelo robert.cast...@upf.edu wrote: some of the goals behind this discussion are IMO similar to the ones for biocMultiAssay: https://github.com/vjcitn/biocMultiAssay maybe Vince can confirm. It is true that there are connections between the concerns But the way I see it, the container design we are talking about in this thread addresses the management of a fixed common assay type over a fixed set of samples. The biocMultiAssay deals with the management of multiple assay types over multiple samples, with possible disparities in sample sets over the different assay types. robert. On 03/04/2015 05:16 PM, Tim Triche, Jr. wrote: Oh, I don't disagree. Perhaps the two problems can be addressed simultaneously by 1) deciding on what contracts a multi-assay container can/would demand to be useful 2) calling it something besides SummarizedExperiment, say, ExperimentCollection Then the SE API could stay the same as it is (which is already very useful) and progress could be sought in the offshoot (ExperimentCollection or whatever) without breaking things that rely on SE. Just off the top of my head, a most generically useful container for DNA methylation CNV data (which can of course be called from the same assay) is Kasper JP's GenomicRatioSet, which already has some weird quirks for eSet backwards compatibility. (e.g. sampleNames(x) works, but sampleNames(x)- does not work; pData(x) calls colData(x); fData(x) calls rowData(x)) There are little niggles that I should probably just send in a patch for, but a cleaner overall container would be better, if for no other reason than the aforementioned ability to easily experiment with imputation. An approach that I've been using is to stuff the SNPs, CNV (as GRanges) and mRNA/miRNA (as a matrix) data into exptData(SE). This is... somewhat less than optimal, especially when subsetting. But it does suggest that I could define a coercion from the current rambling wreck into a nice clean new class/API (ExperimentCollection or whatever) and I'll bet other package authors could, too. The presence of a GRangesFrame would then be handy for returning a given assay's results, so that the user could be blissfully ignorant of the storage backing (ff, BigMatrix, Matrix, matrix, Rle, whatever) but not lose the data management advantages of a SummarizedExperiment. JMHO Statistics is the grammar of science. Karl Pearsonhttp://en.wikipedia.org/wiki/The_Grammar_of_Science On Wed, Mar 4, 2015 at 6:40 AM, Vincent Careyst...@channing.harvard.edu wrote: I am a bit concerned about any major alterations to the SummarizedExperiment API. We have two papers and plenty of working code that use it in meaningful ways. Effort required to keep new formulations back-compatible as well as bug-free has to be weighed seriously. I agree that the name is not ideal. We are learning as we go. Seems to make sense to start with the contracts we want the instances of a class to satisfy. I have long felt that X[i, j] idiom is one users and developers should be comfortable with, even insist on, and for consistency with matrix operations idiom, it should work in a natural way for numeric indexing. This seems like an important constraint. subsetBy* is a useful idiom, but it is conceivable that we would adopt filter() for row-oriented selections and select() for column-oriented selections. Do we have to make any special design considerations to allow very smooth interoperation with out-of-memory resources for certain components for developers who want to allow this? We should have a reasonable way to get data on what is out there, what is used, how it is most effectively used. What's the SE API? Is it well-adapted to requirements of DESeq2? Other killer packages that use/don't use it? Even getting data on the formal API for a class is not all that familiar. And if folks are writing non-S4 interfaces (i.e., naked functions) we have no way of identifying them. See below for one way of discovering the API for SummarizedExperiment. In summary, I think we have to be careful about overdesigning too early. Getting clear on contracts seems the best way to ensure reuse, and we really want that so that reliability is continually assessed. My sense is that it is good to give developers something they'll gladly extend, not necessarily reuse directly.
Re: [Bioc-devel] Changes to the SummarizedExperiment Class
some of the goals behind this discussion are IMO similar to the ones for biocMultiAssay: https://github.com/vjcitn/biocMultiAssay maybe Vince can confirm. robert. On 03/04/2015 05:16 PM, Tim Triche, Jr. wrote: Oh, I don't disagree. Perhaps the two problems can be addressed simultaneously by 1) deciding on what contracts a multi-assay container can/would demand to be useful 2) calling it something besides SummarizedExperiment, say, ExperimentCollection Then the SE API could stay the same as it is (which is already very useful) and progress could be sought in the offshoot (ExperimentCollection or whatever) without breaking things that rely on SE. Just off the top of my head, a most generically useful container for DNA methylation CNV data (which can of course be called from the same assay) is Kasper JP's GenomicRatioSet, which already has some weird quirks for eSet backwards compatibility. (e.g. sampleNames(x) works, but sampleNames(x)- does not work; pData(x) calls colData(x); fData(x) calls rowData(x)) There are little niggles that I should probably just send in a patch for, but a cleaner overall container would be better, if for no other reason than the aforementioned ability to easily experiment with imputation. An approach that I've been using is to stuff the SNPs, CNV (as GRanges) and mRNA/miRNA (as a matrix) data into exptData(SE). This is... somewhat less than optimal, especially when subsetting. But it does suggest that I could define a coercion from the current rambling wreck into a nice clean new class/API (ExperimentCollection or whatever) and I'll bet other package authors could, too. The presence of a GRangesFrame would then be handy for returning a given assay's results, so that the user could be blissfully ignorant of the storage backing (ff, BigMatrix, Matrix, matrix, Rle, whatever) but not lose the data management advantages of a SummarizedExperiment. JMHO Statistics is the grammar of science. Karl Pearsonhttp://en.wikipedia.org/wiki/The_Grammar_of_Science On Wed, Mar 4, 2015 at 6:40 AM, Vincent Careyst...@channing.harvard.edu wrote: I am a bit concerned about any major alterations to the SummarizedExperiment API. We have two papers and plenty of working code that use it in meaningful ways. Effort required to keep new formulations back-compatible as well as bug-free has to be weighed seriously. I agree that the name is not ideal. We are learning as we go. Seems to make sense to start with the contracts we want the instances of a class to satisfy. I have long felt that X[i, j] idiom is one users and developers should be comfortable with, even insist on, and for consistency with matrix operations idiom, it should work in a natural way for numeric indexing. This seems like an important constraint. subsetBy* is a useful idiom, but it is conceivable that we would adopt filter() for row-oriented selections and select() for column-oriented selections. Do we have to make any special design considerations to allow very smooth interoperation with out-of-memory resources for certain components for developers who want to allow this? We should have a reasonable way to get data on what is out there, what is used, how it is most effectively used. What's the SE API? Is it well-adapted to requirements of DESeq2? Other killer packages that use/don't use it? Even getting data on the formal API for a class is not all that familiar. And if folks are writing non-S4 interfaces (i.e., naked functions) we have no way of identifying them. See below for one way of discovering the API for SummarizedExperiment. In summary, I think we have to be careful about overdesigning too early. Getting clear on contracts seems the best way to ensure reuse, and we really want that so that reliability is continually assessed. My sense is that it is good to give developers something they'll gladly extend, not necessarily reuse directly. So we don't have to have broad consensus on class details, but on the minimal abstraction and on obligatory tests on its basic implementation. methods(class=SummarizedExperiment) # perhaps an obsolete version of methods cataloguer by MTM DataFrame with 76 rows and 3 columns generic signature package character charactercharacter 1 [ x=SummarizedExperiment, i=ANY, j=ANY, drop=ANY base 2 [ x=SummarizedExperiment, i=ANY, j=missing, value=ANY base 3 [ x=SummarizedExperiment, i=ANY, j=missing base 4[- x=SummarizedExperiment, i=ANY, j=ANY, value=SummarizedExperiment base 5 assay x=SummarizedExperiment, i=character GenomicRanges ... ... ... ... 72 updateObject object=SummarizedExperiment BiocGenerics 73values x=SummarizedExperiment S4Vectors 74 values- x=SummarizedExperiment S4Vectors 75
Re: [Bioc-devel] Changes to the SummarizedExperiment Class
On 03/04/2015 10:03 AM, Peter Haverty wrote: Michael has a good point. The complexity of the BioC universe of classes hurts our ability to attract new users. More classes would be a minus there ... but a small set of common, explicit APIs would simplify things. Rectangular things implement the matrix Interface. :-) Deprecating old stuff, like eSet, might help more than it hurts, on the simplicity front. P.S. apropos of understanding this universe of classes, I *love* the methods(class=x) thing Vincent mentioned. The current version, under R-devel, is at devtools::source_gist(https://gist.github.com/mtmorgan/9f98871adb9f0c1891a4;) methods(class=SummarizedExperiment) [1] [ [[[[- [- [5] $ $- assay assay- [9] assayNamesassayNames- assaysassays- [13] cbind coercecolData colData- [17] compare Compare countOverlaps coverage [21] dim dimnames dimnames-disjointBins [25] distance distanceToNearest duplicatedelementMetadata [29] elementMetadata- end end- exptData [33] exptData-extractROWS findOverlaps flank [37] followgranges isDisjointmcols [41] mcols- narrownearest order [45] overlapsAny precede rangesranges- [49] rank rbind replaceROWS resize [53] restrict rowData rowData- seqinfo [57] seqinfo- seqnames shift show [61] sort split start start- [65] strandstrand- subsetsubsetByOverlaps [69] updateObject valuesvalues- width [73] width- see ?methods for accessing help and source code and head(attr(methods(class=SummarizedExperiment), info)) generic visible [,SummarizedExperiment,ANY-method [TRUE [[,SummarizedExperiment,ANY,missing-method[[TRUE [[-,SummarizedExperiment,ANY,missing-method[[-TRUE [-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method [-TRUE $,SummarizedExperiment-method $TRUE $-,SummarizedExperiment-method $-TRUE isS4 from [,SummarizedExperiment,ANY-methodTRUE GenomicRanges [[,SummarizedExperiment,ANY,missing-method TRUE GenomicRanges [[-,SummarizedExperiment,ANY,missing-method TRUE GenomicRanges [-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method TRUE GenomicRanges $,SummarizedExperiment-methodTRUE GenomicRanges $-,SummarizedExperiment-method TRUE GenomicRanges Martin Pete Peter M. Haverty, Ph.D. Genentech, Inc. phave...@gene.com On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence lawrence.mich...@gene.com wrote: I think we need to make sure that there are enough benefits of something like GRangesFrame before we introduce yet another complicated and overlapping data structure into the framework. Prior to summarization, the ranges seem primary, after summarization, it may often make sense for them to be secondary. But I'm just not sure what we gain from a new data structure. On Wed, Mar 4, 2015 at 12:28 AM, Herv� Pag�s hpa...@fredhutch.org wrote: GRangesFrame is an interesting idea and I gave it some thoughts. There is this nice symmetry between GRanges and GRangesFrame: - GRanges = a naked GRanges + a DataFrame accessible via mcols() - GRangesFrame = a DataFrame + a naked GRanges accessible via some accessor (e.g. rowRanges()) So GRanges and GRangesFrame are equivalent in terms of what they can hold, but different in terms of API: the former has the ranges API as primary API and the DataFrame API on its mcols() component, and the latter has the DataFrame API as primary API and the ranges API on its rowRanges() component. Nice switch! What does this API switch bring us? A GRangesFrame object is now an object that fully behaves like a DataFrame and people can also perform range-based operations on its rowRanges() component. Here is what I'm afraid is going to happen: people will also want to be able to perform range-based operations *directly* on these objects, i.e. without having to call rowRanges() first. So for example when they do subsetByOverlaps(), subsetting happens vertically. Also the Hits object returned by findOverlaps() would contain row indices. Problem with this is that these objects now start to suffer from the dual
[Bioc-devel] New(ish!) Seattle Bioconductor team member
Let me take this belated opportunity to introduce Jim Hester jhes...@fredhutch.org to the Bioconductor developer community. Jim is working in the short term on SummarizedExperiment, including the refactoring efforts he introduced yesterday as well as coercion methods to and from ExpressionSet (an initial version from ExpressionSet to SummarizedExperiment is available in the development version GenomicRanges; iterations will include coercion in the reverse direction as well as perhaps more 'clever' mapping between the probeset or gene names of ExpressionSet and relevant range-based notation). Jim will also contribute to ongoing project activities like new package reviews, package maintenance, and upcoming release activities. Jim brings a lot of interesting biological and software development experience to the project. Say hi when you have a chance! Martin -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] Advice (was Re: CRAN package with Bioconductor dependencies)
HI, I'm following this discussion with interest, for the following reason. There are more than a dozen packages that I have written and still maintain. Most of them were started while I was at M.D. Anderson , They were served from a highly non-mainstream repository hosted there, with the code managed in a local Subversion repository. behind their firewall Since moving to Ohio State, I transferred the code to R-Forge. (If you want to figure out what the packages are and do, search for OOMPA.) So, it's still in a non-mainstream repository, but it's (to continue the metaphor) at least on a bigger tributary than it used to be. Many of the packages are written to be compatible with some of the core BioConductor classes, which means that they import Biobase. But all of the functionality is available without using BioConductor (provided the user is willing to assemble the data into the correct set of matrices). I've been thinking about submitting it to either CRAN or BioConductor. Which makes more sense? Best, Kevin On 3/4/2015 4:27 PM, Laurent Gatto wrote: On 3 March 2015 06:07, Henrik Bengtsson wrote: Not that long ago DESCRIPTION field 'Additional_repositories' was introduced which the purpose of providing references to non-mainstream package repositories, e.g. R-Forge. Interestingly, by mainstream they mean CRAN and Bioconductor. The 'Additional_repositories' field is also enforced for CRAN depending on non-mainstream packages, where depending on can be any package under Depends, Imports, Suggests and (I guess), LinkingTo and Enhances. Thanks, Henrik! If I understand well, Bioconductor is considered a mainstream repository and so is not expected to be added as an Additional_repository (despite the fact that install.packages does not install the Bioc repository by default). The issues with doing so nevertheless would be that CRAN maintainer might complaind and this would break the tied R/Bioc versions. Best wishes, Laurent I bet that in a, hopefully, not too far future, we'll find that install.packages() will install from not only CRAN by default, but also Bioconductor and whatever Additional_repositories suggests. As usual, the bet is about food and drinks in person whenever/whenever feasible. BTW, I have a few feature requests related to Bioc releases/versions: 1. Add release date to online announcement pages online, e.g. http://bioconductor.org/news/bioc_2_14_release/ 2. A data.frame listing Bioc versions and their release dates (maybe even time stamps), e.g. biocVersions() 1.0 2002-04-29 ... 2.14 2014-10-14 3.0 2014-04-14 3.1 2015-04-17 3. As far as I understand it, the recommended Bioc version to use depends on R version and the date (in the past only R version). I would like to have a function that returns the Bioc version as a function of R version and date. Maybe BiocInstaller::biocVersion() could be extended with this feature, e.g. biocVersion - function(date, rversion) { ## Current? if (missing(date) missing(rversion)) return(BIOC_VERSION) if (missing(date) date - Sys.date() date - as.Date(date) if (missing(rversion)) rversion - getRversion() ## Lookup by (rversion, date) from known releases ## and make best guesses for the future (with a warning) ... } If such a function could be available as a light-weight script online, then the proper Bioc repos could be downloaded by tools:::.BioC_version_associated_with_R_version(), cf. Martin's reply on lagging Bioc versions. This would bring us one step closer to installing Bioc packages using install.packages(), cf. Laurent's original post. Because it may not be clear to an R user that they need to go to Bioconductor because a CRAN package depends on a Bioc package. That user might not even have heard of Bioconductor. Not suggesting biocLite() should be replaced, but the gap for using install.packages() could be made smaller. ... and maybe one day we'll have an omnibus package installer/updater available in a fresh R installation. The above biocVersion() function would also be useful for figuring out what R/BioC version was in use at a certain year in the past (e.g. reproducing old work) and for finding out versions of Bioc release/devel packages back in time (e.g. if you try to be backward compatible). Thxs, Henrik On Mon, Mar 2, 2015 at 3:41 PM, Laurent Gatto lg...@cam.ac.uk wrote: Thank you all for your answers. Laurent On 2 March 2015 23:27, Martin Morgan wrote: On 03/02/2015 03:18 PM, Laurent Gatto wrote: Dear all, I had never realised that CRAN packages that depended on Bioc packages could actually not be installed with install.packages without setting a repo or using BiocInstaller::bioLite. Here is an example using a fresh R installation http://cran.r-project.org/web/packages/MSeasy/index.html Depends: amap, clValid, cluster, fpc, mzR, xcms $ docker run --rm -ti rocker/r-base R R version 3.1.2 (2014-10-31) -- Pumpkin Helmet Copyright (C)
Re: [Bioc-devel] Advice (was Re: CRAN package with Bioconductor dependencies)
On Wed, Mar 4, 2015 at 5:21 PM, Kevin Coombes kevin.r.coom...@gmail.com wrote: HI, I'm following this discussion with interest, for the following reason. There are more than a dozen packages that I have written and still maintain. Most of them were started while I was at M.D. Anderson , They were served from a highly non-mainstream repository hosted there, with the code managed in a local Subversion repository. behind their firewall Since moving to Ohio State, I transferred the code to R-Forge. (If you want to figure out what the packages are and do, search for OOMPA.) So, it's still in a non-mainstream repository, but it's (to continue the metaphor) at least on a bigger tributary than it used to be. Many of the packages are written to be compatible with some of the core BioConductor classes, which means that they import Biobase. But all of the functionality is available without using BioConductor (provided the user is willing to assemble the data into the correct set of matrices). I've been thinking about submitting it to either CRAN or BioConductor. Which makes more sense? Sometime we should write up comments to help with decisionmaking in this domain. In my view the main difference at this time is the simultaneous management of release and devel streams in Bioc. This leads to a bit of additional complexity for the developer but it permits aggressive experimentation in the devel branch that will add mileage from those using the bleeding edge, while not affecting users of the release branch. There may be some differences in the continuous integration interface and the task view discovery support. There are probably other differences that others should chime in on. Best, Kevin On 3/4/2015 4:27 PM, Laurent Gatto wrote: On 3 March 2015 06:07, Henrik Bengtsson wrote: Not that long ago DESCRIPTION field 'Additional_repositories' was introduced which the purpose of providing references to non-mainstream package repositories, e.g. R-Forge. Interestingly, by mainstream they mean CRAN and Bioconductor. The 'Additional_repositories' field is also enforced for CRAN depending on non-mainstream packages, where depending on can be any package under Depends, Imports, Suggests and (I guess), LinkingTo and Enhances. Thanks, Henrik! If I understand well, Bioconductor is considered a mainstream repository and so is not expected to be added as an Additional_repository (despite the fact that install.packages does not install the Bioc repository by default). The issues with doing so nevertheless would be that CRAN maintainer might complaind and this would break the tied R/Bioc versions. Best wishes, Laurent I bet that in a, hopefully, not too far future, we'll find that install.packages() will install from not only CRAN by default, but also Bioconductor and whatever Additional_repositories suggests. As usual, the bet is about food and drinks in person whenever/whenever feasible. BTW, I have a few feature requests related to Bioc releases/versions: 1. Add release date to online announcement pages online, e.g. http://bioconductor.org/news/bioc_2_14_release/ 2. A data.frame listing Bioc versions and their release dates (maybe even time stamps), e.g. biocVersions() 1.0 2002-04-29 ... 2.14 2014-10-14 3.0 2014-04-14 3.1 2015-04-17 3. As far as I understand it, the recommended Bioc version to use depends on R version and the date (in the past only R version). I would like to have a function that returns the Bioc version as a function of R version and date. Maybe BiocInstaller::biocVersion() could be extended with this feature, e.g. biocVersion - function(date, rversion) { ## Current? if (missing(date) missing(rversion)) return(BIOC_VERSION) if (missing(date) date - Sys.date() date - as.Date(date) if (missing(rversion)) rversion - getRversion() ## Lookup by (rversion, date) from known releases ## and make best guesses for the future (with a warning) ... } If such a function could be available as a light-weight script online, then the proper Bioc repos could be downloaded by tools:::.BioC_version_associated_with_R_version(), cf. Martin's reply on lagging Bioc versions. This would bring us one step closer to installing Bioc packages using install.packages(), cf. Laurent's original post. Because it may not be clear to an R user that they need to go to Bioconductor because a CRAN package depends on a Bioc package. That user might not even have heard of Bioconductor. Not suggesting biocLite() should be replaced, but the gap for using install.packages() could be made smaller. ... and maybe one day we'll have an omnibus package installer/updater available in a fresh R installation. The above biocVersion() function would also be useful for figuring out what R/BioC version was in use at a certain year in the past (e.g. reproducing old work) and for finding out versions of Bioc
Re: [Bioc-devel] New(ish!) Seattle Bioconductor team member
On Wed, Mar 4, 2015 at 2:29 PM, Michael Lawrence lawrence.mich...@gene.com wrote: Welcome. For those who don't know, Jim is also the author of the neat lintr package, which checks your R code as you type, across multiple editors. https://github.com/jimhester/lintr Not to mention https://github.com/jimhester/covr - It only took me one round of 'covr' to become a test-coverage-oholic. Jim, great to have you on board. /Henrik Michael On Wed, Mar 4, 2015 at 2:20 PM, Martin Morgan mtmor...@fredhutch.org wrote: Let me take this belated opportunity to introduce Jim Hester jhes...@fredhutch.org to the Bioconductor developer community. Jim is working in the short term on SummarizedExperiment, including the refactoring efforts he introduced yesterday as well as coercion methods to and from ExpressionSet (an initial version from ExpressionSet to SummarizedExperiment is available in the development version GenomicRanges; iterations will include coercion in the reverse direction as well as perhaps more 'clever' mapping between the probeset or gene names of ExpressionSet and relevant range-based notation). Jim will also contribute to ongoing project activities like new package reviews, package maintenance, and upcoming release activities. Jim brings a lot of interesting biological and software development experience to the project. Say hi when you have a chance! Martin -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] CRAN package with Bioconductor dependencies
Hi Gordon, On 03/04/2015 02:12 PM, Gordon Brown wrote: Hi, Hi, Bioc-devel folks, Is there an accepted way to migrate a package from CRAN to BioC? CRAN's policy says, The package�s license must give the right for CRAN to distribute the package in perpetuity Is it enough to request that CRAN archive the package, then submit it as a new package to BioC? I own a CRAN package (msarc) that probably should have been in BioC from the start, but isn't. Suggestions? I can't speak to the CRAN policy side of things, but several packages have migrated from CRAN to Bioconductor. At a minimum, you need to make sure that the version in Bioconductor is higher than the version in CRAN, but you should also definitely ask CRAN to archive the package. The way Bioconductor works, your package will first be available only in the devel version, but will then become available in the release version after our next release. Currently we're scheduled to release 3.1 on April 17 and the deadline for package submissions for this release is March 27. Dan Thanks, Dan. I'll check with the CRAN folks to see what they say, and pass along anything significant regarding policy etc. It's not a crisis if there's a lag between getting it archived at CRAN and into the release version of Bioconductor. The CRAN folks will ask you to proceed the other way around. First submit your package to Bioconductor. Only after we release (in April), ask them to remove and archive your package. It makes sense since this ensures continuous availability of the package. Cheers, H. Cheers, - Gord ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpa...@fredhutch.org Phone: (206) 667-5791 Fax:(206) 667-1319 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] CRAN package with Bioconductor dependencies
Hi, Bioc-devel folks, Is there an accepted way to migrate a package from CRAN to BioC? CRAN's policy says, The package�s license must give the right for CRAN to distribute the package in perpetuity Is it enough to request that CRAN archive the package, then submit it as a new package to BioC? I own a CRAN package (msarc) that probably should have been in BioC from the start, but isn't. Suggestions? Thanks in advance, - Gord [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Changes to the SummarizedExperiment Class
I am a bit concerned about any major alterations to the SummarizedExperiment API. We have two papers and plenty of working code that use it in meaningful ways. Effort required to keep new formulations back-compatible as well as bug-free has to be weighed seriously. I agree that the name is not ideal. We are learning as we go. Seems to make sense to start with the contracts we want the instances of a class to satisfy. I have long felt that X[i, j] idiom is one users and developers should be comfortable with, even insist on, and for consistency with matrix operations idiom, it should work in a natural way for numeric indexing. This seems like an important constraint. subsetBy* is a useful idiom, but it is conceivable that we would adopt filter() for row-oriented selections and select() for column-oriented selections. Do we have to make any special design considerations to allow very smooth interoperation with out-of-memory resources for certain components for developers who want to allow this? We should have a reasonable way to get data on what is out there, what is used, how it is most effectively used. What's the SE API? Is it well-adapted to requirements of DESeq2? Other killer packages that use/don't use it? Even getting data on the formal API for a class is not all that familiar. And if folks are writing non-S4 interfaces (i.e., naked functions) we have no way of identifying them. See below for one way of discovering the API for SummarizedExperiment. In summary, I think we have to be careful about overdesigning too early. Getting clear on contracts seems the best way to ensure reuse, and we really want that so that reliability is continually assessed. My sense is that it is good to give developers something they'll gladly extend, not necessarily reuse directly. So we don't have to have broad consensus on class details, but on the minimal abstraction and on obligatory tests on its basic implementation. methods(class=SummarizedExperiment) # perhaps an obsolete version of methods cataloguer by MTM DataFrame with 76 rows and 3 columns generic signature package character character character 1 [ x=SummarizedExperiment, i=ANY, j=ANY, drop=ANY base 2 [ x=SummarizedExperiment, i=ANY, j=missing, value=ANY base 3 [ x=SummarizedExperiment, i=ANY, j=missing base 4[- x=SummarizedExperiment, i=ANY, j=ANY, value=SummarizedExperiment base 5 assay x=SummarizedExperiment, i=character GenomicRanges ... ... ... ... 72 updateObject object=SummarizedExperiment BiocGenerics 73values x=SummarizedExperiment S4Vectors 74 values- x=SummarizedExperiment S4Vectors 75 width x=SummarizedExperiment BiocGenerics 76 width- x=SummarizedExperiment BiocGenerics On Wed, Mar 4, 2015 at 8:32 AM, Hector Corrada Bravo hcorr...@gmail.com wrote: May I advocate for 'IndexedDataFrame' or 'IndexedFrame'? 'rowIndices' can return whatever makes sense (GRanges, or other data structures -thinking taxonomy for metagenomics for example-). GRangesFrame can inherit from this. On Wed, Mar 4, 2015 at 3:28 AM, Hervé Pagès hpa...@fredhutch.org wrote: GRangesFrame is an interesting idea and I gave it some thoughts. There is this nice symmetry between GRanges and GRangesFrame: - GRanges = a naked GRanges + a DataFrame accessible via mcols() - GRangesFrame = a DataFrame + a naked GRanges accessible via some accessor (e.g. rowRanges()) So GRanges and GRangesFrame are equivalent in terms of what they can hold, but different in terms of API: the former has the ranges API as primary API and the DataFrame API on its mcols() component, and the latter has the DataFrame API as primary API and the ranges API on its rowRanges() component. Nice switch! What does this API switch bring us? A GRangesFrame object is now an object that fully behaves like a DataFrame and people can also perform range-based operations on its rowRanges() component. Here is what I'm afraid is going to happen: people will also want to be able to perform range-based operations *directly* on these objects, i.e. without having to call rowRanges() first. So for example when they do subsetByOverlaps(), subsetting happens vertically. Also the Hits object returned by findOverlaps() would contain row indices. Problem with this is that these objects now start to suffer from the dual personality syndrome. For example, it's not clear anymore what their length should be. Strictly speaking it should be their number of columns (that's what the length of a DataFrame is), but the ranges API that we're trying to put on them also makes them feel like vectors along the vertical dimension so it also feels that their length
Re: [Bioc-devel] Changes to the SummarizedExperiment Class
May I advocate for 'IndexedDataFrame' or 'IndexedFrame'? 'rowIndices' can return whatever makes sense (GRanges, or other data structures -thinking taxonomy for metagenomics for example-). GRangesFrame can inherit from this. On Wed, Mar 4, 2015 at 3:28 AM, Hervé Pagès hpa...@fredhutch.org wrote: GRangesFrame is an interesting idea and I gave it some thoughts. There is this nice symmetry between GRanges and GRangesFrame: - GRanges = a naked GRanges + a DataFrame accessible via mcols() - GRangesFrame = a DataFrame + a naked GRanges accessible via some accessor (e.g. rowRanges()) So GRanges and GRangesFrame are equivalent in terms of what they can hold, but different in terms of API: the former has the ranges API as primary API and the DataFrame API on its mcols() component, and the latter has the DataFrame API as primary API and the ranges API on its rowRanges() component. Nice switch! What does this API switch bring us? A GRangesFrame object is now an object that fully behaves like a DataFrame and people can also perform range-based operations on its rowRanges() component. Here is what I'm afraid is going to happen: people will also want to be able to perform range-based operations *directly* on these objects, i.e. without having to call rowRanges() first. So for example when they do subsetByOverlaps(), subsetting happens vertically. Also the Hits object returned by findOverlaps() would contain row indices. Problem with this is that these objects now start to suffer from the dual personality syndrome. For example, it's not clear anymore what their length should be. Strictly speaking it should be their number of columns (that's what the length of a DataFrame is), but the ranges API that we're trying to put on them also makes them feel like vectors along the vertical dimension so it also feels that their length should be their number of rows. Same thing with 1D subsetting. Why does it subset the columns and not the rows? Most people are now confused. It's interesting to note that the same thing happens with GRanges objects, but in the opposite direction: people wish they could do DataFrame operations directly on them without calling mcols() first. But in order to preserve the good health of GRanges objects, we've not done that (except for $, a shortcut for mcols(x)$, the pressure was just too strong). H. On 03/03/2015 04:35 PM, Michael Lawrence wrote: Should be possible for the annotations to be of any type, as long as they satisfy a simple contract of NROW() and 2D [. Then, you could have a DataFrame, GRanges, or whatever in there. But it would be nice to have a special class for the container with range information. The contract for the range annotation would be to have a granges() method. I agree it would be nice if there was a way with the methods package to easily assert such contracts. For example, one could define an interface with a set of generics (and optionally the relevant position in the generic signature). Then, once all of the methods have been assigned for a particular class, it is made to inherit from that contract class. There are lots of gotchas though. Not sure how useful it would be in practice. On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty haverty.pe...@gene.com wrote: There are some nice similarities in these new imaginary types. A GRangesFrame is a list of dimensionally identical things (columns) and some row meta-data (the GRanges). The SE-like object is similarly a list of dimensionally like things (matrices, RleDataFrames, BigMatrix objects, HDF5-backed things) with some row meta-data (a DataFrame or GRangesFrame). Elegant? Maybe they would actually be relatives in the class tree. I wonder if this kind of thing would be easier if we had Java-style Interfaces or duck-typing. The x slot of y holds something that implements this set of methods ... Oh, and kinda apropos, the genoset class will probably go away or become an extension to this new SE-like thing. The extra stuff that comes along with genoset will still be available. Pete Peter M. Haverty, Ph.D. Genentech, Inc. phave...@gene.com On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. tim.tri...@gmail.com wrote: This. It would be damned near perfect as a return value for assays coming out of an object that held several such assays at several time points in a population, where there are both assay-wise and covariate-wise holes that could nonetheless be usefully imputed across assays. Statistics is the grammar of science. Karl Pearson http://en.wikipedia.org/wiki/The_Grammar_of_Science On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty haverty.pe...@gene.com wrote: I still think GRanges should be a subclass of DataFrame, which would make this easy, but I don't seem to be winning that argument. Just impossible. As Michael mentioned back in November, they have conflicting