Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Tim Triche, Jr.
Oh, I don't disagree.  Perhaps the two problems can be addressed
simultaneously by

1) deciding on what contracts a multi-assay container can/would demand to
be useful
2) calling it something besides SummarizedExperiment, say,
ExperimentCollection

Then the SE API could stay the same as it is (which is already very useful)
and progress could be sought in the offshoot (ExperimentCollection or
whatever) without breaking things that rely on SE.

Just off the top of my head, a most generically useful container for DNA
methylation  CNV data (which can of course be called from the same assay)
is Kasper  JP's GenomicRatioSet, which already has some weird quirks for
eSet backwards compatibility.  (e.g. sampleNames(x) works, but
sampleNames(x) - does not work; pData(x) calls colData(x); fData(x) calls
rowData(x))  There are little niggles that I should probably just send in a
patch for, but a cleaner overall container would be better, if for no other
reason than the aforementioned ability to easily experiment with
imputation. An approach that I've been using is to stuff the SNPs, CNV (as
GRanges) and mRNA/miRNA (as a matrix) data into exptData(SE).  This is...
somewhat less than optimal, especially when subsetting.

But it does suggest that I could define a coercion from the current
rambling wreck into a nice clean new class/API (ExperimentCollection or
whatever) and I'll bet other package authors could, too.  The presence of a
GRangesFrame would then be handy for returning a given assay's results, so
that the user could be blissfully ignorant of the storage backing (ff,
BigMatrix, Matrix, matrix, Rle, whatever) but not lose the data management
advantages of a SummarizedExperiment.

JMHO







Statistics is the grammar of science.
Karl Pearson http://en.wikipedia.org/wiki/The_Grammar_of_Science

On Wed, Mar 4, 2015 at 6:40 AM, Vincent Carey st...@channing.harvard.edu
wrote:

  I am a bit concerned about any major alterations to the
 SummarizedExperiment API.  We have
 two papers and plenty of working code that use it in meaningful ways.
 Effort required to keep new
 formulations back-compatible as well as bug-free has to be weighed
 seriously.

  I agree that the name is not ideal.  We are learning as we go.

  Seems to make sense to start with the contracts we want the instances of
 a class to satisfy.  I have long felt
 that X[i, j] idiom is one users and developers should be comfortable with,
 even insist on, and for consistency
 with matrix operations idiom, it should work in a natural way for numeric
 indexing.  This seems like an important
 constraint.  subsetBy* is a useful idiom, but it is conceivable that we
 would adopt filter() for row-oriented selections
 and select() for column-oriented selections.  Do we have to make any
 special design considerations to allow
 very smooth interoperation with out-of-memory resources for certain
 components for developers who want to allow this?

  We should have a reasonable way to get data on what is out there, what
 is used, how it is most effectively used.
 What's the SE API?  Is it well-adapted to requirements of DESeq2?  Other
 killer packages that use/don't use it?
 Even getting data on the formal API for a class is not all that familiar.
 And if folks are writing non-S4 interfaces (i.e., naked
 functions) we have no way of identifying them.  See below for one way of
 discovering the API for SummarizedExperiment.

  In summary, I think we have to be careful about overdesigning too
 early.  Getting clear on contracts seems the best
 way to ensure reuse, and we really want that so that reliability is
 continually assessed.  My sense is that it is good
 to give developers something they'll gladly extend, not necessarily reuse
 directly.  So we don't have to have
 broad consensus on class details, but on the minimal abstraction and on
 obligatory tests on its basic implementation.

  methods(class=SummarizedExperiment)  # perhaps an obsolete version of
 methods cataloguer by MTM

 DataFrame with 76 rows and 3 columns

  generic
   signature   package

  character
 character   character

 1  [   x=SummarizedExperiment, i=ANY,
 j=ANY, drop=ANY  base

 2  [  x=SummarizedExperiment, i=ANY,
 j=missing, value=ANY  base

 3  [   x=SummarizedExperiment,
 i=ANY, j=missing  base

 4[- x=SummarizedExperiment, i=ANY, j=ANY,
 value=SummarizedExperiment  base

 5  assay
 x=SummarizedExperiment, i=character GenomicRanges

 ...  ...
 ...   ...

 72  updateObject
 object=SummarizedExperiment  BiocGenerics

 73values
 x=SummarizedExperiment S4Vectors

 74  values-
 x=SummarizedExperiment S4Vectors

 75 width
 x=SummarizedExperiment  BiocGenerics

 76   width-
 x=SummarizedExperiment  BiocGenerics

 On Wed, Mar 4, 2015 at 8:32 AM, Hector Corrada Bravo 

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Michael Lawrence
I think we need to make sure that there are enough benefits of something
like GRangesFrame before we introduce yet another complicated and
overlapping data structure into the framework. Prior to summarization, the
ranges seem primary, after summarization, it may often make sense for them
to be secondary. But I'm just not sure what we gain from a new data
structure.

On Wed, Mar 4, 2015 at 12:28 AM, Hervé Pagès hpa...@fredhutch.org wrote:

 GRangesFrame is an interesting idea and I gave it some thoughts.

 There is this nice symmetry between GRanges and GRangesFrame:

 - GRanges = a naked GRanges + a DataFrame accessible via mcols()

 - GRangesFrame = a DataFrame + a naked GRanges accessible via
  some accessor (e.g. rowRanges())

 So GRanges and GRangesFrame are equivalent in terms of what they
 can hold, but different in terms of API: the former has the ranges
 API as primary API and the DataFrame API on its mcols() component,
 and the latter has the DataFrame API as primary API and the ranges
 API on its rowRanges() component. Nice switch!

 What does this API switch bring us? A GRangesFrame object is now
 an object that fully behaves like a DataFrame and people can also
 perform range-based operations on its rowRanges() component.
 Here is what I'm afraid is going to happen: people will also want
 to be able to perform range-based operations *directly* on
 these objects, i.e. without having to call rowRanges() first.
 So for example when they do subsetByOverlaps(), subsetting
 happens vertically. Also the Hits object returned by findOverlaps()
 would contain row indices. Problem with this is that these objects
 now start to suffer from the dual personality syndrome. For
 example, it's not clear anymore what their length should be.
 Strictly speaking it should be their number of columns (that's
 what the length of a DataFrame is), but the ranges API that
 we're trying to put on them also makes them feel like vectors
 along the vertical dimension so it also feels that their length
 should be their number of rows. Same thing with 1D subsetting.
 Why does it subset the columns and not the rows? Most people
 are now confused.

 It's interesting to note that the same thing happens with GRanges
 objects, but in the opposite direction: people wish they could
 do DataFrame operations directly on them without calling mcols()
 first. But in order to preserve the good health of GRanges objects,
 we've not done that (except for $, a shortcut for mcols(x)$,
 the pressure was just too strong).

 H.



 On 03/03/2015 04:35 PM, Michael Lawrence wrote:

 Should be possible for the annotations to be of any type, as long as they
 satisfy a simple contract of NROW() and 2D [. Then, you could have a
 DataFrame, GRanges, or whatever in there. But it would be nice to have a
 special class for the container with range information. The contract for
 the range annotation would be to have a granges() method.

 I agree it would be nice if there was a way with the methods package to
 easily assert such contracts. For example, one could define an interface
 with a set of generics (and optionally the relevant position in the
 generic
 signature). Then, once all of the methods have been assigned for a
 particular class, it is made to inherit from that contract class. There
 are
 lots of gotchas though. Not sure how useful it would be in practice.


 On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty haverty.pe...@gene.com
 wrote:

  There are some nice similarities in these new imaginary types.  A
 GRangesFrame is a list of dimensionally identical things (columns) and
 some row meta-data (the GRanges).  The SE-like object is similarly a list
 of dimensionally like things (matrices, RleDataFrames, BigMatrix objects,
 HDF5-backed things) with some row meta-data (a DataFrame or
 GRangesFrame).
 Elegant?  Maybe they would actually be relatives in the class tree.

 I wonder if this kind of thing would be easier if we had Java-style
 Interfaces or duck-typing.  The x slot of y holds something that
 implements this set of methods ...

 Oh, and kinda apropos, the genoset class will probably go away or become
 an extension to this new SE-like thing.  The extra stuff that comes along
 with genoset will still be available.

 Pete

 
 Peter M. Haverty, Ph.D.
 Genentech, Inc.
 phave...@gene.com

 On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. tim.tri...@gmail.com
 wrote:

  This.

 It would be damned near perfect as a return value for assays coming out
 of
 an object that held several such assays at several time points in a
 population, where there are both assay-wise and covariate-wise holes
 that
 could nonetheless be usefully imputed across assays.


 Statistics is the grammar of science.
 Karl Pearson http://en.wikipedia.org/wiki/The_Grammar_of_Science

 On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty haverty.pe...@gene.com
 wrote:




   I still think GRanges should be a subclass of DataFrame,

 which would make 

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Tim Triche, Jr.
What complexity?  The Nature Methods paper laid it out: for most people,
most of the time, use an SE.

That way, the organization of metadata and covariates is enforced for you,
like an ExpressionSet (another winning data structure) but without its
baggage.

Maybe the Summarized in the name isn't such a bad idea after all.
 AfterTheDataMungingIsDone doesn't have the same ring to it.

What would be equally awesome IMHO is to have a similarly unifying
structure for integrative work.

But that's just, like, my opinion.  I've taken a whack at it when I knew
even less than I do now, and it's hard.  However, data management for
expression arrays was hard, too.  If I'm not mistaken, there were benefits
to solving that data management problem, too.  Some sort of a software
project.  I think it was called MADMAN.  I'll have to go look.  ;-)



Statistics is the grammar of science.
Karl Pearson http://en.wikipedia.org/wiki/The_Grammar_of_Science

On Wed, Mar 4, 2015 at 10:03 AM, Peter Haverty haverty.pe...@gene.com
wrote:

  Michael has a good point. The complexity of the BioC universe of classes
 hurts our ability to attract new users. More classes would be a minus there
 ... but a small set of common, explicit APIs would simplify things.
 Rectangular things implement the matrix Interface.  :-) Deprecating old
 stuff, like eSet, might help more than it hurts, on the simplicity front.

  P.S. apropos of understanding this universe of classes, I *love* the
 methods(class=x) thing Vincent mentioned.

  Pete

 
 Peter M. Haverty, Ph.D.
 Genentech, Inc.
 phave...@gene.com

 On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence 
 lawrence.mich...@gene.com wrote:

 I think we need to make sure that there are enough benefits of something
 like GRangesFrame before we introduce yet another complicated and
 overlapping data structure into the framework. Prior to summarization, the
 ranges seem primary, after summarization, it may often make sense for them
 to be secondary. But I'm just not sure what we gain from a new data
 structure.

 On Wed, Mar 4, 2015 at 12:28 AM, Hervé Pagès hpa...@fredhutch.org
 wrote:

 GRangesFrame is an interesting idea and I gave it some thoughts.

 There is this nice symmetry between GRanges and GRangesFrame:

 - GRanges = a naked GRanges + a DataFrame accessible via mcols()

 - GRangesFrame = a DataFrame + a naked GRanges accessible via
  some accessor (e.g. rowRanges())

 So GRanges and GRangesFrame are equivalent in terms of what they
 can hold, but different in terms of API: the former has the ranges
 API as primary API and the DataFrame API on its mcols() component,
 and the latter has the DataFrame API as primary API and the ranges
 API on its rowRanges() component. Nice switch!

 What does this API switch bring us? A GRangesFrame object is now
 an object that fully behaves like a DataFrame and people can also
 perform range-based operations on its rowRanges() component.
 Here is what I'm afraid is going to happen: people will also want
 to be able to perform range-based operations *directly* on
 these objects, i.e. without having to call rowRanges() first.
 So for example when they do subsetByOverlaps(), subsetting
 happens vertically. Also the Hits object returned by findOverlaps()
 would contain row indices. Problem with this is that these objects
 now start to suffer from the dual personality syndrome. For
 example, it's not clear anymore what their length should be.
 Strictly speaking it should be their number of columns (that's
 what the length of a DataFrame is), but the ranges API that
 we're trying to put on them also makes them feel like vectors
 along the vertical dimension so it also feels that their length
 should be their number of rows. Same thing with 1D subsetting.
 Why does it subset the columns and not the rows? Most people
 are now confused.

 It's interesting to note that the same thing happens with GRanges
 objects, but in the opposite direction: people wish they could
 do DataFrame operations directly on them without calling mcols()
 first. But in order to preserve the good health of GRanges objects,
 we've not done that (except for $, a shortcut for mcols(x)$,
 the pressure was just too strong).

 H.



 On 03/03/2015 04:35 PM, Michael Lawrence wrote:

 Should be possible for the annotations to be of any type, as long as
 they
 satisfy a simple contract of NROW() and 2D [. Then, you could have a
 DataFrame, GRanges, or whatever in there. But it would be nice to have a
 special class for the container with range information. The contract for
 the range annotation would be to have a granges() method.

 I agree it would be nice if there was a way with the methods package to
 easily assert such contracts. For example, one could define an interface
 with a set of generics (and optionally the relevant position in the
 generic
 signature). Then, once all of the methods have been assigned for a
 particular class, it is made to inherit from that 

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Vincent Carey
On Wed, Mar 4, 2015 at 12:01 PM, Robert Castelo robert.cast...@upf.edu
wrote:

 some of the goals behind this discussion are IMO similar to the ones for
 biocMultiAssay:

 https://github.com/vjcitn/biocMultiAssay

 maybe Vince can confirm.



It is true that there are connections between the concerns  But the way I
see it, the container design we
are talking about in this thread addresses the management of a fixed common
assay type over a fixed set of samples.

The biocMultiAssay deals with the management of multiple assay types over
multiple samples, with possible
disparities in sample sets over the different assay types.



 robert.

 On 03/04/2015 05:16 PM, Tim Triche, Jr. wrote:

 Oh, I don't disagree.  Perhaps the two problems can be addressed
 simultaneously by

 1) deciding on what contracts a multi-assay container can/would demand to
 be useful
 2) calling it something besides SummarizedExperiment, say,
 ExperimentCollection

 Then the SE API could stay the same as it is (which is already very
 useful)
 and progress could be sought in the offshoot (ExperimentCollection or
 whatever) without breaking things that rely on SE.

 Just off the top of my head, a most generically useful container for DNA
 methylation  CNV data (which can of course be called from the same assay)
 is Kasper  JP's GenomicRatioSet, which already has some weird quirks for
 eSet backwards compatibility.  (e.g. sampleNames(x) works, but
 sampleNames(x)- does not work; pData(x) calls colData(x); fData(x) calls
 rowData(x))  There are little niggles that I should probably just send in
 a
 patch for, but a cleaner overall container would be better, if for no
 other
 reason than the aforementioned ability to easily experiment with
 imputation. An approach that I've been using is to stuff the SNPs, CNV (as
 GRanges) and mRNA/miRNA (as a matrix) data into exptData(SE).  This is...
 somewhat less than optimal, especially when subsetting.

 But it does suggest that I could define a coercion from the current
 rambling wreck into a nice clean new class/API (ExperimentCollection or
 whatever) and I'll bet other package authors could, too.  The presence of
 a
 GRangesFrame would then be handy for returning a given assay's results, so
 that the user could be blissfully ignorant of the storage backing (ff,
 BigMatrix, Matrix, matrix, Rle, whatever) but not lose the data management
 advantages of a SummarizedExperiment.

 JMHO







 Statistics is the grammar of science.
 Karl Pearsonhttp://en.wikipedia.org/wiki/The_Grammar_of_Science


 On Wed, Mar 4, 2015 at 6:40 AM, Vincent Careyst...@channing.harvard.edu
 wrote:

I am a bit concerned about any major alterations to the
 SummarizedExperiment API.  We have
 two papers and plenty of working code that use it in meaningful ways.
 Effort required to keep new
 formulations back-compatible as well as bug-free has to be weighed
 seriously.

   I agree that the name is not ideal.  We are learning as we go.

   Seems to make sense to start with the contracts we want the instances
 of
 a class to satisfy.  I have long felt
 that X[i, j] idiom is one users and developers should be comfortable
 with,
 even insist on, and for consistency
 with matrix operations idiom, it should work in a natural way for numeric
 indexing.  This seems like an important
 constraint.  subsetBy* is a useful idiom, but it is conceivable that we
 would adopt filter() for row-oriented selections
 and select() for column-oriented selections.  Do we have to make any
 special design considerations to allow
 very smooth interoperation with out-of-memory resources for certain
 components for developers who want to allow this?

   We should have a reasonable way to get data on what is out there, what
 is used, how it is most effectively used.
 What's the SE API?  Is it well-adapted to requirements of DESeq2?  Other
 killer packages that use/don't use it?
 Even getting data on the formal API for a class is not all that familiar.
 And if folks are writing non-S4 interfaces (i.e., naked
 functions) we have no way of identifying them.  See below for one way of
 discovering the API for SummarizedExperiment.

   In summary, I think we have to be careful about overdesigning too
 early.  Getting clear on contracts seems the best
 way to ensure reuse, and we really want that so that reliability is
 continually assessed.  My sense is that it is good
 to give developers something they'll gladly extend, not necessarily reuse
 directly.  So we don't have to have
 broad consensus on class details, but on the minimal abstraction and on
 obligatory tests on its basic implementation.

  methods(class=SummarizedExperiment)  # perhaps an obsolete version of

 methods cataloguer by MTM

 DataFrame with 76 rows and 3 columns

   generic
signature   package

   character
  charactercharacter

 1  [   x=SummarizedExperiment, i=ANY,
 j=ANY, drop=ANY  base

 2  [  

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Peter Haverty
Michael has a good point. The complexity of the BioC universe of classes
hurts our ability to attract new users. More classes would be a minus there
... but a small set of common, explicit APIs would simplify things.
Rectangular things implement the matrix Interface.  :-) Deprecating old
stuff, like eSet, might help more than it hurts, on the simplicity front.

P.S. apropos of understanding this universe of classes, I *love* the
methods(class=x) thing Vincent mentioned.

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence lawrence.mich...@gene.com
wrote:

 I think we need to make sure that there are enough benefits of something
 like GRangesFrame before we introduce yet another complicated and
 overlapping data structure into the framework. Prior to summarization, the
 ranges seem primary, after summarization, it may often make sense for them
 to be secondary. But I'm just not sure what we gain from a new data
 structure.

 On Wed, Mar 4, 2015 at 12:28 AM, Herv� Pag�s hpa...@fredhutch.org wrote:

 GRangesFrame is an interesting idea and I gave it some thoughts.

 There is this nice symmetry between GRanges and GRangesFrame:

 - GRanges = a naked GRanges + a DataFrame accessible via mcols()

 - GRangesFrame = a DataFrame + a naked GRanges accessible via
  some accessor (e.g. rowRanges())

 So GRanges and GRangesFrame are equivalent in terms of what they
 can hold, but different in terms of API: the former has the ranges
 API as primary API and the DataFrame API on its mcols() component,
 and the latter has the DataFrame API as primary API and the ranges
 API on its rowRanges() component. Nice switch!

 What does this API switch bring us? A GRangesFrame object is now
 an object that fully behaves like a DataFrame and people can also
 perform range-based operations on its rowRanges() component.
 Here is what I'm afraid is going to happen: people will also want
 to be able to perform range-based operations *directly* on
 these objects, i.e. without having to call rowRanges() first.
 So for example when they do subsetByOverlaps(), subsetting
 happens vertically. Also the Hits object returned by findOverlaps()
 would contain row indices. Problem with this is that these objects
 now start to suffer from the dual personality syndrome. For
 example, it's not clear anymore what their length should be.
 Strictly speaking it should be their number of columns (that's
 what the length of a DataFrame is), but the ranges API that
 we're trying to put on them also makes them feel like vectors
 along the vertical dimension so it also feels that their length
 should be their number of rows. Same thing with 1D subsetting.
 Why does it subset the columns and not the rows? Most people
 are now confused.

 It's interesting to note that the same thing happens with GRanges
 objects, but in the opposite direction: people wish they could
 do DataFrame operations directly on them without calling mcols()
 first. But in order to preserve the good health of GRanges objects,
 we've not done that (except for $, a shortcut for mcols(x)$,
 the pressure was just too strong).

 H.



 On 03/03/2015 04:35 PM, Michael Lawrence wrote:

 Should be possible for the annotations to be of any type, as long as they
 satisfy a simple contract of NROW() and 2D [. Then, you could have a
 DataFrame, GRanges, or whatever in there. But it would be nice to have a
 special class for the container with range information. The contract for
 the range annotation would be to have a granges() method.

 I agree it would be nice if there was a way with the methods package to
 easily assert such contracts. For example, one could define an interface
 with a set of generics (and optionally the relevant position in the
 generic
 signature). Then, once all of the methods have been assigned for a
 particular class, it is made to inherit from that contract class. There
 are
 lots of gotchas though. Not sure how useful it would be in practice.


 On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty haverty.pe...@gene.com
 wrote:

  There are some nice similarities in these new imaginary types.  A
 GRangesFrame is a list of dimensionally identical things (columns) and
 some row meta-data (the GRanges).  The SE-like object is similarly a
 list
 of dimensionally like things (matrices, RleDataFrames, BigMatrix
 objects,
 HDF5-backed things) with some row meta-data (a DataFrame or
 GRangesFrame).
 Elegant?  Maybe they would actually be relatives in the class tree.

 I wonder if this kind of thing would be easier if we had Java-style
 Interfaces or duck-typing.  The x slot of y holds something that
 implements this set of methods ...

 Oh, and kinda apropos, the genoset class will probably go away or become
 an extension to this new SE-like thing.  The extra stuff that comes
 along
 with genoset will still be available.

 Pete

 
 Peter M. Haverty, Ph.D.
 Genentech, 

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Tim Triche, Jr.
My response was meant to address this:

1) fixed-dimension, fixed sample set is a solved problem, and SE is that
solution.
2) multi-assay, holes across samples remains an ugly thorny problem,
maybe needs a new API

So why not keep SE as stable as possible, and dump all the explosive
changes into the latter?


Statistics is the grammar of science.
Karl Pearson http://en.wikipedia.org/wiki/The_Grammar_of_Science

On Wed, Mar 4, 2015 at 9:12 AM, Vincent Carey st...@channing.harvard.edu
wrote:



 On Wed, Mar 4, 2015 at 12:01 PM, Robert Castelo robert.cast...@upf.edu
 wrote:

 some of the goals behind this discussion are IMO similar to the ones for
 biocMultiAssay:

 https://github.com/vjcitn/biocMultiAssay

 maybe Vince can confirm.



 It is true that there are connections between the concerns  But the way I
 see it, the container design we
 are talking about in this thread addresses the management of a fixed
 common assay type over a fixed set of samples.

 The biocMultiAssay deals with the management of multiple assay types over
 multiple samples, with possible
 disparities in sample sets over the different assay types.



 robert.

 On 03/04/2015 05:16 PM, Tim Triche, Jr. wrote:

 Oh, I don't disagree.  Perhaps the two problems can be addressed
 simultaneously by

 1) deciding on what contracts a multi-assay container can/would demand to
 be useful
 2) calling it something besides SummarizedExperiment, say,
 ExperimentCollection

 Then the SE API could stay the same as it is (which is already very
 useful)
 and progress could be sought in the offshoot (ExperimentCollection or
 whatever) without breaking things that rely on SE.

 Just off the top of my head, a most generically useful container for DNA
 methylation  CNV data (which can of course be called from the same
 assay)
 is Kasper  JP's GenomicRatioSet, which already has some weird quirks for
 eSet backwards compatibility.  (e.g. sampleNames(x) works, but
 sampleNames(x)- does not work; pData(x) calls colData(x); fData(x) calls
 rowData(x))  There are little niggles that I should probably just send
 in a
 patch for, but a cleaner overall container would be better, if for no
 other
 reason than the aforementioned ability to easily experiment with
 imputation. An approach that I've been using is to stuff the SNPs, CNV
 (as
 GRanges) and mRNA/miRNA (as a matrix) data into exptData(SE).  This is...
 somewhat less than optimal, especially when subsetting.

 But it does suggest that I could define a coercion from the current
 rambling wreck into a nice clean new class/API (ExperimentCollection or
 whatever) and I'll bet other package authors could, too.  The presence
 of a
 GRangesFrame would then be handy for returning a given assay's results,
 so
 that the user could be blissfully ignorant of the storage backing (ff,
 BigMatrix, Matrix, matrix, Rle, whatever) but not lose the data
 management
 advantages of a SummarizedExperiment.

 JMHO







 Statistics is the grammar of science.
 Karl Pearsonhttp://en.wikipedia.org/wiki/The_Grammar_of_Science


 On Wed, Mar 4, 2015 at 6:40 AM, Vincent Careyst...@channing.harvard.edu
 
 wrote:

I am a bit concerned about any major alterations to the
 SummarizedExperiment API.  We have
 two papers and plenty of working code that use it in meaningful ways.
 Effort required to keep new
 formulations back-compatible as well as bug-free has to be weighed
 seriously.

   I agree that the name is not ideal.  We are learning as we go.

   Seems to make sense to start with the contracts we want the instances
 of
 a class to satisfy.  I have long felt
 that X[i, j] idiom is one users and developers should be comfortable
 with,
 even insist on, and for consistency
 with matrix operations idiom, it should work in a natural way for
 numeric
 indexing.  This seems like an important
 constraint.  subsetBy* is a useful idiom, but it is conceivable that we
 would adopt filter() for row-oriented selections
 and select() for column-oriented selections.  Do we have to make any
 special design considerations to allow
 very smooth interoperation with out-of-memory resources for certain
 components for developers who want to allow this?

   We should have a reasonable way to get data on what is out there, what
 is used, how it is most effectively used.
 What's the SE API?  Is it well-adapted to requirements of DESeq2?  Other
 killer packages that use/don't use it?
 Even getting data on the formal API for a class is not all that
 familiar.
 And if folks are writing non-S4 interfaces (i.e., naked
 functions) we have no way of identifying them.  See below for one way of
 discovering the API for SummarizedExperiment.

   In summary, I think we have to be careful about overdesigning too
 early.  Getting clear on contracts seems the best
 way to ensure reuse, and we really want that so that reliability is
 continually assessed.  My sense is that it is good
 to give developers something they'll gladly extend, not necessarily
 reuse
 directly. 

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Robert Castelo
some of the goals behind this discussion are IMO similar to the ones for 
biocMultiAssay:


https://github.com/vjcitn/biocMultiAssay

maybe Vince can confirm.

robert.

On 03/04/2015 05:16 PM, Tim Triche, Jr. wrote:

Oh, I don't disagree.  Perhaps the two problems can be addressed
simultaneously by

1) deciding on what contracts a multi-assay container can/would demand to
be useful
2) calling it something besides SummarizedExperiment, say,
ExperimentCollection

Then the SE API could stay the same as it is (which is already very useful)
and progress could be sought in the offshoot (ExperimentCollection or
whatever) without breaking things that rely on SE.

Just off the top of my head, a most generically useful container for DNA
methylation  CNV data (which can of course be called from the same assay)
is Kasper  JP's GenomicRatioSet, which already has some weird quirks for
eSet backwards compatibility.  (e.g. sampleNames(x) works, but
sampleNames(x)- does not work; pData(x) calls colData(x); fData(x) calls
rowData(x))  There are little niggles that I should probably just send in a
patch for, but a cleaner overall container would be better, if for no other
reason than the aforementioned ability to easily experiment with
imputation. An approach that I've been using is to stuff the SNPs, CNV (as
GRanges) and mRNA/miRNA (as a matrix) data into exptData(SE).  This is...
somewhat less than optimal, especially when subsetting.

But it does suggest that I could define a coercion from the current
rambling wreck into a nice clean new class/API (ExperimentCollection or
whatever) and I'll bet other package authors could, too.  The presence of a
GRangesFrame would then be handy for returning a given assay's results, so
that the user could be blissfully ignorant of the storage backing (ff,
BigMatrix, Matrix, matrix, Rle, whatever) but not lose the data management
advantages of a SummarizedExperiment.

JMHO







Statistics is the grammar of science.
Karl Pearsonhttp://en.wikipedia.org/wiki/The_Grammar_of_Science

On Wed, Mar 4, 2015 at 6:40 AM, Vincent Careyst...@channing.harvard.edu
wrote:


  I am a bit concerned about any major alterations to the
SummarizedExperiment API.  We have
two papers and plenty of working code that use it in meaningful ways.
Effort required to keep new
formulations back-compatible as well as bug-free has to be weighed
seriously.

  I agree that the name is not ideal.  We are learning as we go.

  Seems to make sense to start with the contracts we want the instances of
a class to satisfy.  I have long felt
that X[i, j] idiom is one users and developers should be comfortable with,
even insist on, and for consistency
with matrix operations idiom, it should work in a natural way for numeric
indexing.  This seems like an important
constraint.  subsetBy* is a useful idiom, but it is conceivable that we
would adopt filter() for row-oriented selections
and select() for column-oriented selections.  Do we have to make any
special design considerations to allow
very smooth interoperation with out-of-memory resources for certain
components for developers who want to allow this?

  We should have a reasonable way to get data on what is out there, what
is used, how it is most effectively used.
What's the SE API?  Is it well-adapted to requirements of DESeq2?  Other
killer packages that use/don't use it?
Even getting data on the formal API for a class is not all that familiar.
And if folks are writing non-S4 interfaces (i.e., naked
functions) we have no way of identifying them.  See below for one way of
discovering the API for SummarizedExperiment.

  In summary, I think we have to be careful about overdesigning too
early.  Getting clear on contracts seems the best
way to ensure reuse, and we really want that so that reliability is
continually assessed.  My sense is that it is good
to give developers something they'll gladly extend, not necessarily reuse
directly.  So we don't have to have
broad consensus on class details, but on the minimal abstraction and on
obligatory tests on its basic implementation.


methods(class=SummarizedExperiment)  # perhaps an obsolete version of

methods cataloguer by MTM

DataFrame with 76 rows and 3 columns

  generic
   signature   package

  character
 charactercharacter

1  [   x=SummarizedExperiment, i=ANY,
j=ANY, drop=ANY  base

2  [  x=SummarizedExperiment, i=ANY,
j=missing, value=ANY  base

3  [   x=SummarizedExperiment,
i=ANY, j=missing  base

4[- x=SummarizedExperiment, i=ANY, j=ANY,
value=SummarizedExperiment  base

5  assay
x=SummarizedExperiment, i=character GenomicRanges

...  ...
 ...   ...

72  updateObject
object=SummarizedExperiment  BiocGenerics

73values
x=SummarizedExperiment S4Vectors

74  values-
x=SummarizedExperiment S4Vectors

75 

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Martin Morgan

On 03/04/2015 10:03 AM, Peter Haverty wrote:

Michael has a good point. The complexity of the BioC universe of classes
hurts our ability to attract new users. More classes would be a minus there
... but a small set of common, explicit APIs would simplify things.
Rectangular things implement the matrix Interface.  :-) Deprecating old
stuff, like eSet, might help more than it hurts, on the simplicity front.

P.S. apropos of understanding this universe of classes, I *love* the
methods(class=x) thing Vincent mentioned.


The current version, under R-devel, is at

  devtools::source_gist(https://gist.github.com/mtmorgan/9f98871adb9f0c1891a4;)

   methods(class=SummarizedExperiment)
   [1] [ [[[[-  [-
   [5] $ $-   assay assay-
   [9] assayNamesassayNames-  assaysassays-
  [13] cbind coercecolData   colData-
  [17] compare   Compare   countOverlaps coverage
  [21] dim   dimnames  dimnames-disjointBins
  [25] distance  distanceToNearest duplicatedelementMetadata
  [29] elementMetadata- end   end- exptData
  [33] exptData-extractROWS   findOverlaps  flank
  [37] followgranges   isDisjointmcols
  [41] mcols-   narrownearest   order
  [45] overlapsAny   precede   rangesranges-
  [49] rank  rbind replaceROWS   resize
  [53] restrict  rowData   rowData- seqinfo
  [57] seqinfo- seqnames  shift show
  [61] sort  split start start-
  [65] strandstrand-  subsetsubsetByOverlaps
  [69] updateObject  valuesvalues-  width
  [73] width-

  see ?methods for accessing help and source code

and

 head(attr(methods(class=SummarizedExperiment), info))
 generic visible
[,SummarizedExperiment,ANY-method  [TRUE
[[,SummarizedExperiment,ANY,missing-method[[TRUE
[[-,SummarizedExperiment,ANY,missing-method[[-TRUE
[-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method [-TRUE
$,SummarizedExperiment-method  $TRUE
$-,SummarizedExperiment-method  $-TRUE
 isS4  from
[,SummarizedExperiment,ANY-methodTRUE GenomicRanges
[[,SummarizedExperiment,ANY,missing-method   TRUE GenomicRanges
[[-,SummarizedExperiment,ANY,missing-method TRUE GenomicRanges
[-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method TRUE GenomicRanges
$,SummarizedExperiment-methodTRUE GenomicRanges
$-,SummarizedExperiment-method  TRUE GenomicRanges

Martin



Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence lawrence.mich...@gene.com
wrote:


I think we need to make sure that there are enough benefits of something
like GRangesFrame before we introduce yet another complicated and
overlapping data structure into the framework. Prior to summarization, the
ranges seem primary, after summarization, it may often make sense for them
to be secondary. But I'm just not sure what we gain from a new data
structure.

On Wed, Mar 4, 2015 at 12:28 AM, Herv� Pag�s hpa...@fredhutch.org wrote:


GRangesFrame is an interesting idea and I gave it some thoughts.

There is this nice symmetry between GRanges and GRangesFrame:

- GRanges = a naked GRanges + a DataFrame accessible via mcols()

- GRangesFrame = a DataFrame + a naked GRanges accessible via
  some accessor (e.g. rowRanges())

So GRanges and GRangesFrame are equivalent in terms of what they
can hold, but different in terms of API: the former has the ranges
API as primary API and the DataFrame API on its mcols() component,
and the latter has the DataFrame API as primary API and the ranges
API on its rowRanges() component. Nice switch!

What does this API switch bring us? A GRangesFrame object is now
an object that fully behaves like a DataFrame and people can also
perform range-based operations on its rowRanges() component.
Here is what I'm afraid is going to happen: people will also want
to be able to perform range-based operations *directly* on
these objects, i.e. without having to call rowRanges() first.
So for example when they do subsetByOverlaps(), subsetting
happens vertically. Also the Hits object returned by findOverlaps()
would contain row indices. Problem with this is that these objects
now start to suffer from the dual 

[Bioc-devel] New(ish!) Seattle Bioconductor team member

2015-03-04 Thread Martin Morgan
Let me take this belated opportunity to introduce Jim Hester 
jhes...@fredhutch.org to the Bioconductor developer community.


Jim is working in the short term on SummarizedExperiment, including the 
refactoring efforts he introduced yesterday as well as coercion methods to and 
from ExpressionSet (an initial version from ExpressionSet to 
SummarizedExperiment is available in the development version GenomicRanges; 
iterations will include coercion in the reverse direction as well as perhaps 
more 'clever' mapping between the probeset or gene names of ExpressionSet and 
relevant range-based notation). Jim will also contribute to ongoing project 
activities like new package reviews, package maintenance, and upcoming release 
activities.


Jim brings a lot of interesting biological and software development experience 
to the project. Say hi when you have a chance!


Martin
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] Advice (was Re: CRAN package with Bioconductor dependencies)

2015-03-04 Thread Kevin Coombes

HI,

I'm following this discussion with interest, for the following reason.  
There are more than a dozen packages that I have written and still 
maintain.  Most of them were started while I was at M.D. Anderson ,  
They were served from a highly non-mainstream repository hosted there, 
with the code managed in a local Subversion repository. behind their 
firewall   Since moving to Ohio State, I transferred the code to 
R-Forge.  (If you want to figure out what the packages are and do, 
search for OOMPA.)  So, it's still in a non-mainstream repository, 
but it's (to continue the metaphor) at least on a bigger tributary than 
it used to be.


Many of the packages are written to be compatible with some of the core 
BioConductor classes, which means that they import Biobase.


But all of the functionality is available without using BioConductor 
(provided the user is willing to assemble the data into the correct set 
of matrices).


I've been thinking about submitting it to either CRAN or BioConductor.  
Which makes more sense?


Best,
  Kevin

On 3/4/2015 4:27 PM, Laurent Gatto wrote:

On  3 March 2015 06:07, Henrik Bengtsson wrote:


Not that long ago DESCRIPTION field 'Additional_repositories' was
introduced which the purpose of providing references to non-mainstream
package repositories, e.g. R-Forge.  Interestingly, by mainstream
they mean CRAN and Bioconductor.  The 'Additional_repositories' field
is also enforced for CRAN depending on non-mainstream packages, where
depending on can be any package under Depends, Imports,
Suggests and (I guess), LinkingTo and Enhances.

Thanks, Henrik!

If I understand well, Bioconductor is considered a mainstream repository
and so is not expected to be added as an Additional_repository (despite
the fact that install.packages does not install the Bioc repository by
default). The issues with doing so nevertheless would be that CRAN
maintainer might complaind and this would break the tied R/Bioc
versions.

Best wishes,

Laurent


I bet that in a, hopefully, not too far future, we'll find that
install.packages() will install from not only CRAN by default, but
also Bioconductor and whatever Additional_repositories suggests.  As
usual, the bet is about food and drinks in person whenever/whenever
feasible.


BTW, I have a few feature requests related to Bioc releases/versions:

1. Add release date to online announcement pages online, e.g.
http://bioconductor.org/news/bioc_2_14_release/


2. A data.frame listing Bioc versions and their release dates (maybe
even time stamps), e.g.


biocVersions()

1.0 2002-04-29
...
2.14 2014-10-14
3.0 2014-04-14
3.1 2015-04-17


3. As far as I understand it, the recommended Bioc version to use
depends on R version and the date (in the past only R version).  I
would like to have a function that returns the Bioc version as a
function of R version and date.  Maybe BiocInstaller::biocVersion()
could be extended with this feature, e.g.

biocVersion - function(date, rversion) {
   ## Current?
   if (missing(date)  missing(rversion)) return(BIOC_VERSION)

   if (missing(date) date - Sys.date()
   date - as.Date(date)
   if (missing(rversion)) rversion - getRversion()

   ## Lookup by (rversion, date) from known releases
   ## and make best guesses for the future (with a warning)
   ...
}

If such a function could be available as a light-weight script online,
then the proper Bioc repos could be downloaded by
tools:::.BioC_version_associated_with_R_version(), cf. Martin's reply
on lagging Bioc versions.  This would bring us one step closer to
installing Bioc packages using install.packages(), cf. Laurent's
original post. Because it may not be clear to an R user that they need
to go to Bioconductor because a CRAN package depends on a Bioc
package. That user might not even have heard of Bioconductor. Not
suggesting biocLite() should be replaced, but the gap for using
install.packages() could be made smaller.  ... and maybe one day we'll
have an omnibus package installer/updater available in a fresh R
installation.

The above biocVersion() function would also be useful for figuring out
what R/BioC version was in use at a certain year in the past (e.g.
reproducing old work) and for finding out versions of Bioc
release/devel packages back in time (e.g. if you try to be backward
compatible).

Thxs,

Henrik

On Mon, Mar 2, 2015 at 3:41 PM, Laurent Gatto lg...@cam.ac.uk wrote:

Thank you all for your answers.

Laurent

On  2 March 2015 23:27, Martin Morgan wrote:


On 03/02/2015 03:18 PM, Laurent Gatto wrote:

Dear all,


I had never realised that CRAN packages that depended on Bioc packages
could actually not be installed with install.packages without setting a
repo or using BiocInstaller::bioLite. Here is an example using a fresh R
installation

http://cran.r-project.org/web/packages/MSeasy/index.html
Depends: amap, clValid, cluster, fpc, mzR, xcms

$ docker run --rm -ti rocker/r-base R

R version 3.1.2 (2014-10-31) -- Pumpkin Helmet
Copyright (C) 

Re: [Bioc-devel] Advice (was Re: CRAN package with Bioconductor dependencies)

2015-03-04 Thread Vincent Carey
On Wed, Mar 4, 2015 at 5:21 PM, Kevin Coombes kevin.r.coom...@gmail.com
wrote:

 HI,

 I'm following this discussion with interest, for the following reason.
 There are more than a dozen packages that I have written and still
 maintain.  Most of them were started while I was at M.D. Anderson ,  They
 were served from a highly non-mainstream repository hosted there, with the
 code managed in a local Subversion repository. behind their firewall
  Since moving to Ohio State, I transferred the code to R-Forge.  (If you
 want to figure out what the packages are and do, search for OOMPA.)  So,
 it's still in a non-mainstream repository, but it's (to continue the
 metaphor) at least on a bigger tributary than it used to be.

 Many of the packages are written to be compatible with some of the core
 BioConductor classes, which means that they import Biobase.

 But all of the functionality is available without using BioConductor
 (provided the user is willing to assemble the data into the correct set of
 matrices).

 I've been thinking about submitting it to either CRAN or BioConductor.
 Which makes more sense?


Sometime we should write up comments to help with decisionmaking in this
domain.  In my view the main
difference at this time is the simultaneous management of release and devel
streams in Bioc.  This leads to
a bit of additional complexity for the developer but it permits aggressive
experimentation in the devel branch
that will add mileage from those using the bleeding edge, while not
affecting users of the release branch.

There may be some differences in the continuous integration interface and
the task view discovery support.
There are probably other differences that others should chime in on.



 Best,
   Kevin

 On 3/4/2015 4:27 PM, Laurent Gatto wrote:

 On  3 March 2015 06:07, Henrik Bengtsson wrote:

  Not that long ago DESCRIPTION field 'Additional_repositories' was
 introduced which the purpose of providing references to non-mainstream
 package repositories, e.g. R-Forge.  Interestingly, by mainstream
 they mean CRAN and Bioconductor.  The 'Additional_repositories' field
 is also enforced for CRAN depending on non-mainstream packages, where
 depending on can be any package under Depends, Imports,
 Suggests and (I guess), LinkingTo and Enhances.

 Thanks, Henrik!

 If I understand well, Bioconductor is considered a mainstream repository
 and so is not expected to be added as an Additional_repository (despite
 the fact that install.packages does not install the Bioc repository by
 default). The issues with doing so nevertheless would be that CRAN
 maintainer might complaind and this would break the tied R/Bioc
 versions.

 Best wishes,

 Laurent

  I bet that in a, hopefully, not too far future, we'll find that
 install.packages() will install from not only CRAN by default, but
 also Bioconductor and whatever Additional_repositories suggests.  As
 usual, the bet is about food and drinks in person whenever/whenever
 feasible.


 BTW, I have a few feature requests related to Bioc releases/versions:

 1. Add release date to online announcement pages online, e.g.
 http://bioconductor.org/news/bioc_2_14_release/


 2. A data.frame listing Bioc versions and their release dates (maybe
 even time stamps), e.g.

  biocVersions()

 1.0 2002-04-29
 ...
 2.14 2014-10-14
 3.0 2014-04-14
 3.1 2015-04-17


 3. As far as I understand it, the recommended Bioc version to use
 depends on R version and the date (in the past only R version).  I
 would like to have a function that returns the Bioc version as a
 function of R version and date.  Maybe BiocInstaller::biocVersion()
 could be extended with this feature, e.g.

 biocVersion - function(date, rversion) {
## Current?
if (missing(date)  missing(rversion)) return(BIOC_VERSION)

if (missing(date) date - Sys.date()
date - as.Date(date)
if (missing(rversion)) rversion - getRversion()

## Lookup by (rversion, date) from known releases
## and make best guesses for the future (with a warning)
...
 }

 If such a function could be available as a light-weight script online,
 then the proper Bioc repos could be downloaded by
 tools:::.BioC_version_associated_with_R_version(), cf. Martin's reply
 on lagging Bioc versions.  This would bring us one step closer to
 installing Bioc packages using install.packages(), cf. Laurent's
 original post. Because it may not be clear to an R user that they need
 to go to Bioconductor because a CRAN package depends on a Bioc
 package. That user might not even have heard of Bioconductor. Not
 suggesting biocLite() should be replaced, but the gap for using
 install.packages() could be made smaller.  ... and maybe one day we'll
 have an omnibus package installer/updater available in a fresh R
 installation.

 The above biocVersion() function would also be useful for figuring out
 what R/BioC version was in use at a certain year in the past (e.g.
 reproducing old work) and for finding out versions of Bioc
 

Re: [Bioc-devel] New(ish!) Seattle Bioconductor team member

2015-03-04 Thread Henrik Bengtsson
On Wed, Mar 4, 2015 at 2:29 PM, Michael Lawrence
lawrence.mich...@gene.com wrote:
 Welcome.

 For those who don't know, Jim is also the author of the neat lintr
 package, which checks your R code as you type, across multiple editors.

 https://github.com/jimhester/lintr

Not to mention https://github.com/jimhester/covr - It only took me one
round of 'covr' to become a test-coverage-oholic.

Jim, great to have you on board.

/Henrik


 Michael

 On Wed, Mar 4, 2015 at 2:20 PM, Martin Morgan mtmor...@fredhutch.org
 wrote:

 Let me take this belated opportunity to introduce Jim Hester 
 jhes...@fredhutch.org to the Bioconductor developer community.

 Jim is working in the short term on SummarizedExperiment, including the
 refactoring efforts he introduced yesterday as well as coercion methods to
 and from ExpressionSet (an initial version from ExpressionSet to
 SummarizedExperiment is available in the development version GenomicRanges;
 iterations will include coercion in the reverse direction as well as
 perhaps more 'clever' mapping between the probeset or gene names of
 ExpressionSet and relevant range-based notation). Jim will also contribute
 to ongoing project activities like new package reviews, package
 maintenance, and upcoming release activities.

 Jim brings a lot of interesting biological and software development
 experience to the project. Say hi when you have a chance!

 Martin
 --
 Computational Biology / Fred Hutchinson Cancer Research Center
 1100 Fairview Ave. N.
 PO Box 19024 Seattle, WA 98109

 Location: Arnold Building M1 B861
 Phone: (206) 667-2793

 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel


 [[alternative HTML version deleted]]

 ___
 Bioc-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/bioc-devel

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] CRAN package with Bioconductor dependencies

2015-03-04 Thread Hervé Pagès

Hi Gordon,

On 03/04/2015 02:12 PM, Gordon Brown wrote:

Hi,


Hi, Bioc-devel folks,

Is there an accepted way to migrate a package from CRAN to BioC?
  CRAN's policy says, The package�s license must give the right for
CRAN to distribute the package in perpetuity

Is it enough to request that CRAN archive the package, then submit it
as a new package to BioC?  I own a CRAN package (msarc) that
probably should have been in BioC from the start, but isn't.
  Suggestions?



I can't speak to the CRAN policy side of things, but several packages have 
migrated from CRAN to Bioconductor.
At a minimum, you need to make sure that the version in Bioconductor is higher 
than the version in CRAN, but you should also definitely ask CRAN to archive 
the package. The way Bioconductor works, your package will first be available 
only in the devel version, but will then become available in the release 
version after our next release. Currently we're scheduled to release 3.1 on 
April 17 and the deadline for package submissions for this release is March 27.

Dan


Thanks, Dan.   I'll check with the CRAN folks to see what they say, and pass 
along anything significant regarding policy etc.  It's not a crisis if there's 
a lag between getting it archived at CRAN and into the release version of 
Bioconductor.


The CRAN folks will ask you to proceed the other way around. First
submit your package to Bioconductor. Only after we release (in April), 
ask them to remove and archive your package. It makes sense since this

ensures continuous availability of the package.

Cheers,
H.



Cheers,

  - Gord
___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] CRAN package with Bioconductor dependencies

2015-03-04 Thread Gordon Brown
Hi, Bioc-devel folks,

Is there an accepted way to migrate a package from CRAN to BioC?  CRAN's policy 
says, The package�s license must give the right for CRAN to distribute the 
package in perpetuity

Is it enough to request that CRAN archive the package, then submit it as a new 
package to BioC?  I own a CRAN package (msarc) that probably should have been 
in BioC from the start, but isn't.  Suggestions?

Thanks in advance,

 - Gord

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Vincent Carey
I am a bit concerned about any major alterations to the
SummarizedExperiment API.  We have
two papers and plenty of working code that use it in meaningful ways.
Effort required to keep new
formulations back-compatible as well as bug-free has to be weighed
seriously.

I agree that the name is not ideal.  We are learning as we go.

Seems to make sense to start with the contracts we want the instances of a
class to satisfy.  I have long felt
that X[i, j] idiom is one users and developers should be comfortable with,
even insist on, and for consistency
with matrix operations idiom, it should work in a natural way for numeric
indexing.  This seems like an important
constraint.  subsetBy* is a useful idiom, but it is conceivable that we
would adopt filter() for row-oriented selections
and select() for column-oriented selections.  Do we have to make any
special design considerations to allow
very smooth interoperation with out-of-memory resources for certain
components for developers who want to allow this?

We should have a reasonable way to get data on what is out there, what is
used, how it is most effectively used.
What's the SE API?  Is it well-adapted to requirements of DESeq2?  Other
killer packages that use/don't use it?
Even getting data on the formal API for a class is not all that familiar.
And if folks are writing non-S4 interfaces (i.e., naked
functions) we have no way of identifying them.  See below for one way of
discovering the API for SummarizedExperiment.

In summary, I think we have to be careful about overdesigning too early.
Getting clear on contracts seems the best
way to ensure reuse, and we really want that so that reliability is
continually assessed.  My sense is that it is good
to give developers something they'll gladly extend, not necessarily reuse
directly.  So we don't have to have
broad consensus on class details, but on the minimal abstraction and on
obligatory tests on its basic implementation.

 methods(class=SummarizedExperiment)  # perhaps an obsolete version of
methods cataloguer by MTM

DataFrame with 76 rows and 3 columns

 generic
signature   package

 character
  character   character

1  [   x=SummarizedExperiment, i=ANY,
j=ANY, drop=ANY  base

2  [  x=SummarizedExperiment, i=ANY,
j=missing, value=ANY  base

3  [   x=SummarizedExperiment,
i=ANY, j=missing  base

4[- x=SummarizedExperiment, i=ANY, j=ANY,
value=SummarizedExperiment  base

5  assay  x=SummarizedExperiment,
i=character GenomicRanges

...  ...
  ...   ...

72  updateObject
object=SummarizedExperiment  BiocGenerics

73values
x=SummarizedExperiment S4Vectors

74  values-
x=SummarizedExperiment S4Vectors

75 width
x=SummarizedExperiment  BiocGenerics

76   width-
x=SummarizedExperiment  BiocGenerics

On Wed, Mar 4, 2015 at 8:32 AM, Hector Corrada Bravo hcorr...@gmail.com
wrote:

 May I advocate for  'IndexedDataFrame' or 'IndexedFrame'? 'rowIndices' can
 return whatever makes sense (GRanges, or other data structures -thinking
 taxonomy for metagenomics for example-). GRangesFrame can inherit from
 this.

 On Wed, Mar 4, 2015 at 3:28 AM, Hervé Pagès hpa...@fredhutch.org wrote:

  GRangesFrame is an interesting idea and I gave it some thoughts.
 
  There is this nice symmetry between GRanges and GRangesFrame:
 
  - GRanges = a naked GRanges + a DataFrame accessible via mcols()
 
  - GRangesFrame = a DataFrame + a naked GRanges accessible via
   some accessor (e.g. rowRanges())
 
  So GRanges and GRangesFrame are equivalent in terms of what they
  can hold, but different in terms of API: the former has the ranges
  API as primary API and the DataFrame API on its mcols() component,
  and the latter has the DataFrame API as primary API and the ranges
  API on its rowRanges() component. Nice switch!
 
  What does this API switch bring us? A GRangesFrame object is now
  an object that fully behaves like a DataFrame and people can also
  perform range-based operations on its rowRanges() component.
  Here is what I'm afraid is going to happen: people will also want
  to be able to perform range-based operations *directly* on
  these objects, i.e. without having to call rowRanges() first.
  So for example when they do subsetByOverlaps(), subsetting
  happens vertically. Also the Hits object returned by findOverlaps()
  would contain row indices. Problem with this is that these objects
  now start to suffer from the dual personality syndrome. For
  example, it's not clear anymore what their length should be.
  Strictly speaking it should be their number of columns (that's
  what the length of a DataFrame is), but the ranges API that
  we're trying to put on them also makes them feel like vectors
  along the vertical dimension so it also feels that their length
  

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

2015-03-04 Thread Hector Corrada Bravo
May I advocate for  'IndexedDataFrame' or 'IndexedFrame'? 'rowIndices' can
return whatever makes sense (GRanges, or other data structures -thinking
taxonomy for metagenomics for example-). GRangesFrame can inherit from this.

On Wed, Mar 4, 2015 at 3:28 AM, Hervé Pagès hpa...@fredhutch.org wrote:

 GRangesFrame is an interesting idea and I gave it some thoughts.

 There is this nice symmetry between GRanges and GRangesFrame:

 - GRanges = a naked GRanges + a DataFrame accessible via mcols()

 - GRangesFrame = a DataFrame + a naked GRanges accessible via
  some accessor (e.g. rowRanges())

 So GRanges and GRangesFrame are equivalent in terms of what they
 can hold, but different in terms of API: the former has the ranges
 API as primary API and the DataFrame API on its mcols() component,
 and the latter has the DataFrame API as primary API and the ranges
 API on its rowRanges() component. Nice switch!

 What does this API switch bring us? A GRangesFrame object is now
 an object that fully behaves like a DataFrame and people can also
 perform range-based operations on its rowRanges() component.
 Here is what I'm afraid is going to happen: people will also want
 to be able to perform range-based operations *directly* on
 these objects, i.e. without having to call rowRanges() first.
 So for example when they do subsetByOverlaps(), subsetting
 happens vertically. Also the Hits object returned by findOverlaps()
 would contain row indices. Problem with this is that these objects
 now start to suffer from the dual personality syndrome. For
 example, it's not clear anymore what their length should be.
 Strictly speaking it should be their number of columns (that's
 what the length of a DataFrame is), but the ranges API that
 we're trying to put on them also makes them feel like vectors
 along the vertical dimension so it also feels that their length
 should be their number of rows. Same thing with 1D subsetting.
 Why does it subset the columns and not the rows? Most people
 are now confused.

 It's interesting to note that the same thing happens with GRanges
 objects, but in the opposite direction: people wish they could
 do DataFrame operations directly on them without calling mcols()
 first. But in order to preserve the good health of GRanges objects,
 we've not done that (except for $, a shortcut for mcols(x)$,
 the pressure was just too strong).

 H.



 On 03/03/2015 04:35 PM, Michael Lawrence wrote:

 Should be possible for the annotations to be of any type, as long as they
 satisfy a simple contract of NROW() and 2D [. Then, you could have a
 DataFrame, GRanges, or whatever in there. But it would be nice to have a
 special class for the container with range information. The contract for
 the range annotation would be to have a granges() method.

 I agree it would be nice if there was a way with the methods package to
 easily assert such contracts. For example, one could define an interface
 with a set of generics (and optionally the relevant position in the
 generic
 signature). Then, once all of the methods have been assigned for a
 particular class, it is made to inherit from that contract class. There
 are
 lots of gotchas though. Not sure how useful it would be in practice.


 On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty haverty.pe...@gene.com
 wrote:

  There are some nice similarities in these new imaginary types.  A
 GRangesFrame is a list of dimensionally identical things (columns) and
 some row meta-data (the GRanges).  The SE-like object is similarly a list
 of dimensionally like things (matrices, RleDataFrames, BigMatrix objects,
 HDF5-backed things) with some row meta-data (a DataFrame or
 GRangesFrame).
 Elegant?  Maybe they would actually be relatives in the class tree.

 I wonder if this kind of thing would be easier if we had Java-style
 Interfaces or duck-typing.  The x slot of y holds something that
 implements this set of methods ...

 Oh, and kinda apropos, the genoset class will probably go away or become
 an extension to this new SE-like thing.  The extra stuff that comes along
 with genoset will still be available.

 Pete

 
 Peter M. Haverty, Ph.D.
 Genentech, Inc.
 phave...@gene.com

 On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. tim.tri...@gmail.com
 wrote:

  This.

 It would be damned near perfect as a return value for assays coming out
 of
 an object that held several such assays at several time points in a
 population, where there are both assay-wise and covariate-wise holes
 that
 could nonetheless be usefully imputed across assays.


 Statistics is the grammar of science.
 Karl Pearson http://en.wikipedia.org/wiki/The_Grammar_of_Science

 On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty haverty.pe...@gene.com
 wrote:




   I still think GRanges should be a subclass of DataFrame,

 which would make this easy, but I don't seem to be winning that

 argument.



 Just impossible. As Michael mentioned back in November, they have
 conflicting