Re: [Bioc-devel] Changes to the SummarizedExperiment Class

Michael Love Tue, 31 Mar 2015 12:42:34 -0700

With GenomicRanges 1.19.48, I'm still having issues with re-naming the
first assay and duplication of memory from my March 9 email. I tried
assayNames<- as well. My use case is if I am given a
SummarizedExperiment where the first element is not named "counts"
(albeit the SE is most likely coming from summarizeOverlaps() and
already named "counts"...).


> sessionInfo()
R Under development (unstable) (2015-03-31 r68129)
Platform: x86_64-apple-darwin12.5.0 (64-bit)
Running under: OS X 10.8.5 (Mountain Lion)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices datasets  utils
   methods   base

other attached packages:
[1] GenomicRanges_1.19.48 GenomeInfoDb_1.3.16   IRanges_2.1.43
S4Vectors_0.5.22
[5] BiocGenerics_0.13.10  testthat_0.9.1        devtools_1.7.0        knitr_1.9
[9] BiocInstaller_1.17.6

loaded via a namespace (and not attached):
[1] formatR_1.1    XVector_0.7.4  tools_3.3.0    stringr_0.6.2  evaluate_0.5.5

On Mon, Mar 9, 2015 at 1:21 PM, Michael Love
<michaelisaiahl...@gmail.com> wrote:
>
>
> On Mar 9, 2015 12:36 PM, "Martin Morgan" <mtmor...@fredhutch.org> wrote:
> >
> > On 03/09/2015 08:07 AM, Michael Love wrote:
> >>
> >> Some guidance on how to avoid duplication of the matrix for developers
> >> would be greatly appreciated.
> >
> >
> > It's unsatisfactory, but using withDimnames=FALSE avoids duplication on 
> > extraction of assays (but obviously you don't have dimnames on the matrix). 
> > Row or column subsetting necessarily causes the subsetted assay data to be 
> > duplicated. There should not be any duplication when rowRanges() or 
> > colData() are changed without changing their dimension / ordering.
> >
>
> Thanks Martin for checking into the regression.
>
> Sorry, I should have been more specific earlier, I meant more 
> guidance/documentation in the man page for SE. I scanned the 'Extension' 
> section but didn't find a note on withDimnames for extracting the matrix or 
> this example of renaming the assays (it seems like this could easily be 
> relevant for other package authors).
>
> A prominent note there might help devs write more memory efficient packages.
>
> The argument section mentions speed but I'd explicitly mention memory given 
> that we're often storing big matrices:
>
> "Setting withDimnames=FALSE  increases the speed with which assays are 
> extracted."
>
> (its entirely possible the info is there but i missed it)
>
> Best,
>
> Mike
>
> >
> >> Another example of a trouble point, is that if I am given an SE with
> >> an unnamed assay and I need to give the assay a name, this also can
> >> expand the memory used. I had found a solution (which works with
> >> GenomicRanges 1.18 / current release) with:
> >>
> >> names(assays(se, withDimnames=FALSE))[1] <- "foo"
> >>
> >> But now I'm looking in devel and this appears to no longer work. The
> >> memory used expands, equivalent to:
> >>
> >> names(assays(se))[1] <- "foo"
> >>
> >> Here's some code to try this:
> >>
> >> m <- matrix(1:1e7,ncol=10,dimnames=list(1:1e6,1:10))
> >> se <- SummarizedExperiment(m)
> >> names(assays(se, withDimnames=FALSE))[1] <- "foo"
> >> names(assays(se))[1] <- "foo"
> >>
> >> while running gc() in between steps.
> >
> >
> > I think this is a regression of some sort, and I'll look into it. Thanks 
> > for the heads-up.
> >
> > Martin
> >
> >
> >>
> >>
> >> On Mon, Mar 9, 2015 at 10:36 AM, Kasper Daniel Hansen
> >> <kasperdanielhan...@gmail.com> wrote:
> >>>
> >>> On Mon, Mar 9, 2015 at 10:30 AM, Vincent Carey 
> >>> <st...@channing.harvard.edu>
> >>> wrote:
> >>>
> >>>> I am glad you are keeping this discussion alive Kasper.
> >>>>
> >>>> On Mon, Mar 9, 2015 at 10:06 AM, Kasper Daniel Hansen <
> >>>> kasperdanielhan...@gmail.com> wrote:
> >>>>
> >>>>> It sounds like the proposed changes are already made.  However (like
> >>>>> others) I am still a bit mystified why this was necessary.  The old
> >>>>> version
> >>>>> did allow for a GRanges inside the DataFrame of the rowData, as far as I
> >>>>> recall.  So I assume this is for efficiency.  But why?  What kind of
> >>>>> data/use cases is this for?
> >>>>>
> >>>>> I am happy to hear that SummarizedExperiment is going to be spun out 
> >>>>> into
> >>>>> its own package.  When that happens, I have some comments, which I'll
> >>>>> include here in anticipation
> >>>>>    1) I now very strongly believe it was a design mistake to not have
> >>>>> colnames on the assays.  The advantage of this choice is that 
> >>>>> sampleNames
> >>>>> are only stored one place.  The extreme disadvantage is the high
> >>>>> ineffeciency when you want colnames on an extracted assay.
> >>>>>
> >>>>
> >>>> after example(SummarizedExperiment)
> >>>>
> >>>>> colnames(assays(se1)[[1]])
> >>>>
> >>>> [1] "A" "B" "C" "D" "E" "F"
> >>>>
> >>>> so this seems to be optional.  But attempts to set rownames will fail
> >>>> silently
> >>>>
> >>>>> rownames(assays(se1)[[1]]) = as.character(1:200)
> >>>>
> >>>>
> >>>>> rownames(assays(se1)[[1]])
> >>>>
> >>>>
> >>>> NULL
> >>>> seems we could issue a warning there
> >>>>
> >>>
> >>>
> >>> Vince, you need to be careful here.
> >>>
> >>> The assays are stored without colnames (unless something has recently
> >>> changed).  The default is to - upon extraction - set the colnames of the
> >>> matrix.  This however requires a copy of the entire matrix.  So
> >>> essentially, upon extraction, each assay is needlessly duplicated to add
> >>> the colnames.  This is what I mean by inefficient. I would prefer to store
> >>> the assays with colnames.  This means that changing sampleNames of the
> >>> object will be inefficient (as it is for eSets) since it would require a
> >>> complete copy of everything.  But I would rather - much rather - copy when
> >>> setting sampleNames than copy when extracting an assay.
> >>>
> >>> Best,
> >>> Kasper
> >>>
> >>>          [[alternative HTML version deleted]]
> >>>
> >>> _______________________________________________
> >>> Bioc-devel@r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>
> >>
> >> _______________________________________________
> >> Bioc-devel@r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >>
> >
> >
> > --
> > Computational Biology / Fred Hutchinson Cancer Research Center
> > 1100 Fairview Ave. N.
> > PO Box 19024 Seattle, WA 98109
> >
> > Location: Arnold Building M1 B861
> > Phone: (206) 667-2793

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Changes to the SummarizedExperiment Class

Reply via email to