With GenomicRanges 1.19.48, I'm still having issues with re-naming the first assay and duplication of memory from my March 9 email. I tried assayNames<- as well. My use case is if I am given a SummarizedExperiment where the first element is not named "counts" (albeit the SE is most likely coming from summarizeOverlaps() and already named "counts"...).
> sessionInfo() R Under development (unstable) (2015-03-31 r68129) Platform: x86_64-apple-darwin12.5.0 (64-bit) Running under: OS X 10.8.5 (Mountain Lion) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats4 parallel stats graphics grDevices datasets utils methods base other attached packages: [1] GenomicRanges_1.19.48 GenomeInfoDb_1.3.16 IRanges_2.1.43 S4Vectors_0.5.22 [5] BiocGenerics_0.13.10 testthat_0.9.1 devtools_1.7.0 knitr_1.9 [9] BiocInstaller_1.17.6 loaded via a namespace (and not attached): [1] formatR_1.1 XVector_0.7.4 tools_3.3.0 stringr_0.6.2 evaluate_0.5.5 On Mon, Mar 9, 2015 at 1:21 PM, Michael Love <michaelisaiahl...@gmail.com> wrote: > > > On Mar 9, 2015 12:36 PM, "Martin Morgan" <mtmor...@fredhutch.org> wrote: > > > > On 03/09/2015 08:07 AM, Michael Love wrote: > >> > >> Some guidance on how to avoid duplication of the matrix for developers > >> would be greatly appreciated. > > > > > > It's unsatisfactory, but using withDimnames=FALSE avoids duplication on > > extraction of assays (but obviously you don't have dimnames on the matrix). > > Row or column subsetting necessarily causes the subsetted assay data to be > > duplicated. There should not be any duplication when rowRanges() or > > colData() are changed without changing their dimension / ordering. > > > > Thanks Martin for checking into the regression. > > Sorry, I should have been more specific earlier, I meant more > guidance/documentation in the man page for SE. I scanned the 'Extension' > section but didn't find a note on withDimnames for extracting the matrix or > this example of renaming the assays (it seems like this could easily be > relevant for other package authors). > > A prominent note there might help devs write more memory efficient packages. > > The argument section mentions speed but I'd explicitly mention memory given > that we're often storing big matrices: > > "Setting withDimnames=FALSE increases the speed with which assays are > extracted." > > (its entirely possible the info is there but i missed it) > > Best, > > Mike > > > > >> Another example of a trouble point, is that if I am given an SE with > >> an unnamed assay and I need to give the assay a name, this also can > >> expand the memory used. I had found a solution (which works with > >> GenomicRanges 1.18 / current release) with: > >> > >> names(assays(se, withDimnames=FALSE))[1] <- "foo" > >> > >> But now I'm looking in devel and this appears to no longer work. The > >> memory used expands, equivalent to: > >> > >> names(assays(se))[1] <- "foo" > >> > >> Here's some code to try this: > >> > >> m <- matrix(1:1e7,ncol=10,dimnames=list(1:1e6,1:10)) > >> se <- SummarizedExperiment(m) > >> names(assays(se, withDimnames=FALSE))[1] <- "foo" > >> names(assays(se))[1] <- "foo" > >> > >> while running gc() in between steps. > > > > > > I think this is a regression of some sort, and I'll look into it. Thanks > > for the heads-up. > > > > Martin > > > > > >> > >> > >> On Mon, Mar 9, 2015 at 10:36 AM, Kasper Daniel Hansen > >> <kasperdanielhan...@gmail.com> wrote: > >>> > >>> On Mon, Mar 9, 2015 at 10:30 AM, Vincent Carey > >>> <st...@channing.harvard.edu> > >>> wrote: > >>> > >>>> I am glad you are keeping this discussion alive Kasper. > >>>> > >>>> On Mon, Mar 9, 2015 at 10:06 AM, Kasper Daniel Hansen < > >>>> kasperdanielhan...@gmail.com> wrote: > >>>> > >>>>> It sounds like the proposed changes are already made. However (like > >>>>> others) I am still a bit mystified why this was necessary. The old > >>>>> version > >>>>> did allow for a GRanges inside the DataFrame of the rowData, as far as I > >>>>> recall. So I assume this is for efficiency. But why? What kind of > >>>>> data/use cases is this for? > >>>>> > >>>>> I am happy to hear that SummarizedExperiment is going to be spun out > >>>>> into > >>>>> its own package. When that happens, I have some comments, which I'll > >>>>> include here in anticipation > >>>>> 1) I now very strongly believe it was a design mistake to not have > >>>>> colnames on the assays. The advantage of this choice is that > >>>>> sampleNames > >>>>> are only stored one place. The extreme disadvantage is the high > >>>>> ineffeciency when you want colnames on an extracted assay. > >>>>> > >>>> > >>>> after example(SummarizedExperiment) > >>>> > >>>>> colnames(assays(se1)[[1]]) > >>>> > >>>> [1] "A" "B" "C" "D" "E" "F" > >>>> > >>>> so this seems to be optional. But attempts to set rownames will fail > >>>> silently > >>>> > >>>>> rownames(assays(se1)[[1]]) = as.character(1:200) > >>>> > >>>> > >>>>> rownames(assays(se1)[[1]]) > >>>> > >>>> > >>>> NULL > >>>> seems we could issue a warning there > >>>> > >>> > >>> > >>> Vince, you need to be careful here. > >>> > >>> The assays are stored without colnames (unless something has recently > >>> changed). The default is to - upon extraction - set the colnames of the > >>> matrix. This however requires a copy of the entire matrix. So > >>> essentially, upon extraction, each assay is needlessly duplicated to add > >>> the colnames. This is what I mean by inefficient. I would prefer to store > >>> the assays with colnames. This means that changing sampleNames of the > >>> object will be inefficient (as it is for eSets) since it would require a > >>> complete copy of everything. But I would rather - much rather - copy when > >>> setting sampleNames than copy when extracting an assay. > >>> > >>> Best, > >>> Kasper > >>> > >>> [[alternative HTML version deleted]] > >>> > >>> _______________________________________________ > >>> Bioc-devel@r-project.org mailing list > >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel > >> > >> > >> _______________________________________________ > >> Bioc-devel@r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/bioc-devel > >> > > > > > > -- > > Computational Biology / Fred Hutchinson Cancer Research Center > > 1100 Fairview Ave. N. > > PO Box 19024 Seattle, WA 98109 > > > > Location: Arnold Building M1 B861 > > Phone: (206) 667-2793 _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel