Re: [Bioc-sig-seq] coverage ranges included in gene ranges and input data

Patrick Aboyoun Thu, 22 Oct 2009 11:43:06 -0700

Florence,
Below is code I would use to find the median coverage depth within specified 
ranges. This snippet includes a simplification to the coverage method that I 
just checked into svn and so will be available tomorrow at bioconductor.org via 
biocLite. Does this code meet your needs and seem reasonable to you?

suppressMessages(library(IRanges))

TSS_Dm_Anno_ch <-

+   RangedData(IRangesList(
+              "2L" = IRanges(start=c(2, 3, 5, 8), width=2),
+              "2R" = IRanges(start=c(1, 4, 6), width=3)))

polII_cov_wt_ch <-

+   RangedData(IRangesList(
+              "2L" = IRanges(start=c(1, 4, 6), width=c(3, 2, 4)),
+              "2R" = IRanges(start=c(1, 2, 3, 6), width=c(3, 3, 3, 4))),
+              score = c(2, 4, 7, 3, 1, 1, 1))

## requires IRanges >= 1.3.96
polIIcoverage <- coverage(polII_cov_wt_ch, weight = "score")
##otherwise:
##polIIcoverage <- coverage(polII_cov_wt_ch, weight = 
values(polII_cov_wt_ch)[,"score"])

cov_med <- aggregate(polIIcoverage, ranges(TSS_Dm_Anno_ch), median)

TSS_Dm_Anno_ch$med_cov <- unlist(cov_med)

TSS_Dm_Anno_ch

RangedData with 7 ranges and 1 value column on 2 sequences
colnames(1): med_cov
names(2): 2L 2R

as.data.frame(TSS_Dm_Anno_ch)

 space start end width med_cov
1    2L     2   3     2     2.0
2    2L     3   4     2     3.0
3    2L     5   6     2     5.5
4    2L     8   9     2     7.0
5    2R     1   3     3     4.0
6    2R     4   6     3     1.0
7    2R     6   8     3     1.0

sessionInfo()

R version 2.10.0 RC (2009-10-18 r50160)i386-apple-darwin9.8.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:

[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] IRanges_1.3.96





Michael Lawrence wrote:

On Thu, Oct 22, 2009 at 3:38 AM, Florence Cavalli <[email protected]>wrote:

Hello,

I am working with ChIP-seq data. I want to get the median of the coverage
for each gene that I have in a RangedData.
My coverage is a RangedData object as well. I extracted the ranges of my
two
objects and I used the %in% function but it gives me the ranges that are
totally or partly in a particular gene. I wish to be more restrictive and
to
include only the ranges that are entirely in the gene.

Here is an example.

class(TSS_Dm_Anno_ch)

[1] "RangedData"
attr(,"package")
[1] "IRanges"

region.TU=ranges(TSS_Dm_Anno_ch)
region.TU

CompressedIRangesList: 7 ranges
sequences(7): "2L" "2R" "3L" "3R" "4" "U" "X"

class(region.TU)

[1] "CompressedIRangesList"
attr(,"package")
[1] "IRanges"

polII_cov_wt_ch

RangedData: 6931803 ranges by 1 column on 7 sequences
colnames(1): score
names(7): 2L 2R 3L 3R 4 U X

values.ranges=ranges(polII_cov_wt_ch)
values.ranges

CompressedIRangesList: 7 ranges
sequences(7): "2L" "2R" "3L" "3R" "4" "U" "X"

## Get the borders for each gene, i,e the row numbers in values.ranges (min
and max) which correspond to ranges that are in the gene (g) range.
borders=lapply(names(values.ranges), function(ch)
t(sapply(1:nrow(as.data.frame(region.TU[[ch]])),function(g)
range(which(values.ranges[[ch]] %in% region.TU[[ch]][g])))))


And I use the following function to obtain the median of the coverage on the

ranges included in the borders

getMedianInRegionFromRangedData <- function(M.values,ranges){
 median.values=sapply(1:nrow(ranges), function(r)
median(M.values[ranges[r,1]:ranges[r,2],]$score))
 return(median.values)
}

Assuming that the RangedData represents a run-length encoded coverage
vector, the above code is taking the median of the run values, not the
median of the actual coverage. That is, as far as I can tell, you are not
weighting the median calculation by the run lengths.

Maybe that's intentional, but anyway RangedData is designed for storing
variables on a set of ranges. A value of a variable, like score, applies to
the range, not to every element of the range. In other words, RangedData is
not formally a run-length encoding, so it is not a very convenient
representation of coverage.

Coverage is more naturally represented as a sequence, e.g. an Rle object.
You could construct an RleList from the RangedData (which would be easy if
your RangedData has the zero regions). Then form an RleViews(List) using the
gene ranges and viewApply() over it.

Maybe others have better ideas..

 median.values=unlist(lapply(names(borders), function(ch)
getMedianInRegionFromRangedData(polII_cov_wt_ch[ch],borders[[ch]])))

I am sure my code should be improved, using a mapply I guess!?

I haven't found yet the appropriate function to be more restrictive.
So far, I did it using something like
min(which(start(values.ranges[[ch]])>start(region.TU[[ch]])[g])) and
max(which(start(values.ranges[[ch]])<end(region.TU[[ch]])[g]))-1 to get
these borders but it's not efficient at all.
Should I use a seqselect or Find functions of the Sequence class?

One more question, is there any particular function in the chipseq package
or in an other package to help us to use the input data, taking it into
account to analyse the real signal?

Many thanks in advance,
Florence

sessionInfo()

R version 2.10.0 alpha (2009-10-04 r49926)
x86_64-unknown-linux-gnu

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] BSgenome.Dmelanogaster.UCSC.dm3_1.3.11
[2] chipseq_0.1.27
[3] ShortRead_1.3.40
[4] lattice_0.17-26
[5] BSgenome_1.13.16
[6] Biostrings_2.13.53
[7] IRanges_1.3.94

loaded via a namespace (and not attached):
[1] Biobase_2.5.8 grid_2.10.0   hwriter_1.1   tools_2.10.0

----------------------------------------
Florence Cavalli
PhD student, Luscombe group
EMBL-European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton
CB10 1SD
Cambridge
UK
email:[email protected] <email%[email protected]> <
email%[email protected] <email%[email protected]>>

       [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] coverage ranges included in gene ranges and input data

Reply via email to