Re: [Bioc-devel] Help in designing class based on SummarizedExperiment

Martin Morgan Fri, 14 Feb 2014 17:57:01 -0800

Hi Peter --

I'm not really sure that I can provide good advice here; it feels a bit like a'round peg square hole' fit with SummarizedExperiment, since your 'pos' is notcharacterized by genomic ranges, and columns of 'assay' does not correspond toseparate samples.


One possibility is to create a Tuple class

    .Tuple <- setClass("Tuple",
        representation(m="integer"),
        contains="SimpleList", )

    Tuple <- function(seqnames, pos, count)
        ## note constructor: base class(es) as first and unnamed arg
        .Tuple(SimpleList(seqnames=seqnames, pos=pos, count=count), m=ncol(pos))

    setMethod(show, "Tuple", function(object) {
        cat(length(object$seqnames), " x ", object@m, "-", sep="")
        callNextMethod()
    })

    tuple <- Tuple(test_data$seqnames,
                   matrix(unlist(test_data[2:4]), ncol=3),
                   matrix(unlist(test_data[5:12]), ncol=8))

You could implement a validity method to enforce correct list membership andelement dimensions, and a subset method to allow for row-wise subsetting acrosselements. You could store the seqnames as an Rle, and the matrix as a sparsematrix (Matrix::Matrix(..., sparse=TRUE)) or as a DataFrame with columns asRle's (but many of your operations will presumably be matrix-like, so you'll endup coercing to a (dense) matrix anyway). I guess your overall data is aTupleList, something like


    .TupleList <- setClass("TupleList", contains="CompressedList",
                           prototype=prototype(elementType="Tuple"))

The pass-by-reference approach used in SummarizedExperiment is not exported fromGenomicRanges, so it's not easy to get the pass-by-reference / copy-on-changesemantics; worth noting that the copying cost is incurred when the data is_modified_, not just accessed, and it's often the case that the data is notupdated after creation. A not-obvious paradigm when updating several slots is tocall the initialize method with the original object, maybe like


    setMethod("[", c("Tuple", "ANY", "missing"),
        function(x, i, j, ..., drop=FALSE)
    {
        if (!missing(drop))
            warning("'drop' ignored when subsetting ", sQuote(class(x)))
        initialize(x, SimpleList(seqnames=x$seqnames[i],
            pos=x$pos[i,,drop=FALSE], count=x$count[i,,drop=FALSE]))
    })

This is more memory-efficient than updating slots one at a time, and presumesthat you have avoided writing an initialize method that breaks this behavior.

The assays slot of SummarizedExperiment can be anything that has matrix-likesemantics, so for instance you could save a sparse matrix


    SummarizedExperiment(SimpleList(counts=Matrix(m, sparse=TRUE)))

There is not a matrix representation based on Rle's, though (maybe there's asparse representation in the Matrix package that does a good job of representingyour data anyway?).

Hope that helps, and is not too misleading. Perhaps others on the list will haveadditional ideas.


Martin

On 02/09/2014 06:20 PM, Peter Hickey wrote:

Hi all,

Apologies up front for the rather long post.

I'm designing a class to store what I call co-methylation m-tuples. These are 
based on a very simple tab-delimited file format.
For example, here are 1-tuples (m = 1):
chr             pos1    M       U
chr1    57691   0       1
chr1    59276   1       0
chr1    60408   1       0
chr1    63495   1       0
chr1    63568   2       0
chr1    63627   3       0

2-tuples (m = 2):
chr             pos1    pos2    MM      MU      UM      UU
chr1    567438  567570  0       0       0       2
chr1    567501  567549  0       0       0       35
chr1    567549  567558  0       1       0       139

3-tuples (m = 3):
chr             pos1    pos2    pos3    MMM     MMU     MUM     MUU     UMM     
UMU     UUM     UUU
chr1    13644   13823   13828   1               0               0               
0               0               0               0               0
chr1    14741   14747   14773   1               0               0               
0               0               0               0               0

etc.

1-tuples are basically the standard input to an analysis of BS-seq data.

I think of these files as being comprised of 3 parts: the 'chr' column (chr), 
the 'pos' matrix (pos1, pos2, pos3) and the 'counts' matrix (MMM, MMU, MUM, 
MUU, UMM, UMU, UUM, UUU), when m = 3. For a given value of 'm' there is one 
'chr' column, m 'pos' columns and 2^m 'counts' columns.

I want to implement a class for these objects as I'm writing a package for the 
analysis of this type of data. I'd like a GRanges-type object storing the 
genomic information and a matrix-like object storing the counts. After 
tinkering around for a while, and doing some reading of the code in packages 
such as GenomicRanges and bsseq, I decided to extend the SummarizedExperiment 
class.  I now have a prototype but I have some questions and would appreciate 
feedback on some of my design choices before I translate my existing functions 
to work with this class of object.

Here is the code for the prototype:
#####################################################################
library(GenomicRanges)

setClass("CoMeth", contains = "SummarizedExperiment")

CoMeth <- function(seqnames, pos, counts, m, methylation_type, sample_name, strand = 
"*", seqlengths = NULL, seqinfo = NULL){

   # Argument checks, etc. go here #

   gr <- GRanges(seqnames = seqnames, ranges = IRanges(start = pos[[1]], end = 
pos[[length(pos)]]), strand = strand, seqlengths = seqlengths, seqinfo = seqinfo) 
# The width of each element is defined by the first and last 'pos', e.g. for 
3-tuples it is defined by pos1 and pos3.
   # Need to store the "extra" positions if m > 2. Each additional position is 
stored as a separate assay
   if (m > 2){
     extra_pos <- lapply(seq(2, m - 1, 1), function(i, pos){
       pos[[i]]
     }, pos = pos)
     names(extra_pos) <- names(pos)[2:(m-1)]
   } else {
     extra_pos <- NULL
   }
   assays <- SimpleList(c(counts, extra_pos))
   colData <- DataFrame(sample_name = sample_name, m = m, methylation_type = 
paste0(sort(methylation_type), collapse = '/'))
   cometh <- SummarizedExperiment(assays = assays, rowData = gr, colData = 
colData)
   cometh <- as(cometh, "CoMeth")

   return(cometh)
}

And here's some example data:
# A function that roughly imitates the output of a call to scan() to read in 
BS-seq m-tuple data
# m is the size of the m-tuples
# n is the number of m-tuples
# z is the proportion of each column of 'counts' that is zero
make_test_data <- function(m, n, z){
   seqnames <- list(seqnames = rep('chr1', n))
   pos <- lapply(1:m, function(x, n){matrix(seq(from = 1 + x - 1, to = n + x - 
1, by = 1), ncol = 1)}, n = n) # Need these to be matrices rather than vectors
   names(pos) <- paste0('pos', 1:m)
   # A rough hack to simulate counts where a proportion (z) are 0 and the rest 
are sampled from Poisson(lambda). Small values of lambda will inflate the 
zero-count.
   counts <- mapply(FUN = function(i, z, n, lambda){
     nz <- floor(n * (1 - z))
     matrix(sample(c(rpois(nz, lambda), rep(0, n - nz))), ncol = 1)
     }, i = 1:(2 ^ m), z = z, n = n, lambda = 4, SIMPLIFY = FALSE) # Need these 
to be matrices rather than vectors
   names(counts) <- sort(do.call(paste0, expand.grid(lapply(seq_len(m), 
function(x){c('M', 'U')}))))
   return(c(seqnames, pos, counts))
}

m <- 3 # An example using 3-tuples
n <- 1000 # A typical value for 3-tuples from a methylC-seq experiment is n = 
17,000,000
z <- c(0.2, 0.6, 0.6, 0.7, 0.6, 0.8, 0.8, 0.7) # Typical proportions of each 
column of 'counts' that are zero when using 3-tuples for a methylC-seq experiment
test_data <- make_test_data(n = n, m = m, z = z)
cometh <- CoMeth(seqnames = test_data[['seqnames']], pos = 
test_data[grepl('pos', names(test_data))], counts = test_data[grepl('[MU]', 
names(test_data))], m = m, methylation_type = 'CG', sample_name = 'test_data')

sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] GenomicRanges_1.14.4 XVector_0.2.0        IRanges_1.20.6       
BiocGenerics_0.8.0

loaded via a namespace (and not attached):
[1] stats4_3.0.2 tools_3.0.2
#####################################################################

Questions
        1. How can I move the 'extra_pos' columns from the assay slot but keep 
the copy-on-change behaviour? From a design perspective, I think it would make 
more sense for the 'extra_pos' columns, i.e. ('pos2') for 3-tuples and ('pos2', 
'pos3') for 4-tuples etc., to be in their own slot rather than in the assays 
slot, after all, they aren't assays but rather are additional genomic 
co-ordinates. The 'extra_pos' fields are fixed (at least until I start 
subsetting or combining multiple CoMeth objects). My understanding of the the 
SummarizedExperiment class is that the assays slot is a reference class to 
avoid excessive copying when changing other slots of a SummarizedExperiment 
object. So if the 'extra_pos' columns were stored outside of the assays slot 
then these would have to be copied when any changes are made to the other slots 
of a CoMeth object, correct? Is there a way to avoid this, i.e. so that these 
'extra_pos' columns are stored separately from the assays slot but with the

  copy-on-change behaviour of the assays slot?
        2. Is the correct to compute something based on the 'counts' data via 
the assay() accessor? For example, I might want a helper function 
getCounts(cometh) that does the equivalent of sapply(X = 1:(2^m), function(i, 
cometh){assay(cometh, i)}, cometh = cometh). Similarly, I might want to compute 
the coverage of an m-tuple, which would be the equivalent of 
rowSums(getCounts(cometh)). Is this the correct way to do this sort of thing?
        3. How do I measure the size of a SummarizedExperiment/CoMeth object? For example, with the 
test data, print(object.size(cometh), units = "auto") < 
print(object.size(assays(cometh)), units = "auto"), so it seems that the size of the assays 
slot isn't counted by object.size().
        4. Is it possible to store an Rle-type object in the assays slot of a 
SummarizedExperiment? 20-80% of the entries in each column of 'counts' are zero 
and there are often runs of zeros. So I thought that perhaps an Rle 
representation (column-wise) might be more (memory) efficient. But I can't seem 
to get an Rle object in the assays slot (I tried via DataFrame); is it even 
possible?
        5. Are there matrix-like objects with Rle columns? I found this thread started by 
Kasper Hansen (https://stat.ethz.ch/pipermail/bioconductor/2012-June/046473.html) 
discussing the idea of matrix-like object where the columns are Rle's; I could imagine 
using such an object for a CoMeth object containing multiple samples, i.e. MMM is a 
matrix-like object with ncol = # of samples, MMU is matrix-like object with ncol = # of 
samples, etc. Was anything like this ever implemented? My reading of the previous thread 
was to use a DataFrame but the "matrix API", e.g. rowSums, doesn't work with 
DataFrames (and see (4) as to whether it's even possible to store such objects in the 
assays slot).

Many thanks for your help in answering these questions. Any other suggestions 
on the design of the CoMeth class are appreciated.

Thanks,
Pete
--------------------------------
Peter Hickey,
PhD Student/Research Assistant,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
Ph: +613 9345 2324

hic...@wehi.edu.au
http://www.wehi.edu.au


______________________________________________________________________
The information in this email is confidential and inte...{{dropped:16}}


_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Help in designing class based on SummarizedExperiment

Reply via email to