[Bioc-sig-seq] coverage ranges included in gene ranges and input data

Florence Cavalli Thu, 22 Oct 2009 03:39:15 -0700

Hello,

I am working with ChIP-seq data. I want to get the median of the coverage
for each gene that I have in a RangedData.
My coverage is a RangedData object as well. I extracted the ranges of my two
objects and I used the %in% function but it gives me the ranges that are
totally or partly in a particular gene. I wish to be more restrictive and to
include only the ranges that are entirely in the gene.


Here is an example.

> class(TSS_Dm_Anno_ch)
[1] "RangedData"
attr(,"package")
[1] "IRanges"
> region.TU=ranges(TSS_Dm_Anno_ch)
> region.TU
CompressedIRangesList: 7 ranges
sequences(7): "2L" "2R" "3L" "3R" "4" "U" "X"
> class(region.TU)
[1] "CompressedIRangesList"
attr(,"package")
[1] "IRanges"

> polII_cov_wt_ch
RangedData: 6931803 ranges by 1 column on 7 sequences
colnames(1): score
names(7): 2L 2R 3L 3R 4 U X
> values.ranges=ranges(polII_cov_wt_ch)
> values.ranges
CompressedIRangesList: 7 ranges
sequences(7): "2L" "2R" "3L" "3R" "4" "U" "X"

## Get the borders for each gene, i,e the row numbers in values.ranges (min
and max) which correspond to ranges that are in the gene (g) range.
borders=lapply(names(values.ranges), function(ch)
t(sapply(1:nrow(as.data.frame(region.TU[[ch]])),function(g)
range(which(values.ranges[[ch]] %in% region.TU[[ch]][g])))))

And I use the following function to obtain the median of the coverage on the
ranges included in the borders

getMedianInRegionFromRangedData <- function(M.values,ranges){
  median.values=sapply(1:nrow(ranges), function(r)
median(M.values[ranges[r,1]:ranges[r,2],]$score))
  return(median.values)
}

 median.values=unlist(lapply(names(borders), function(ch)
getMedianInRegionFromRangedData(polII_cov_wt_ch[ch],borders[[ch]])))

I am sure my code should be improved, using a mapply I guess!?

I haven't found yet the appropriate function to be more restrictive.
So far, I did it using something like
min(which(start(values.ranges[[ch]])>start(region.TU[[ch]])[g])) and
max(which(start(values.ranges[[ch]])<end(region.TU[[ch]])[g]))-1 to get
these borders but it's not efficient at all.
Should I use a seqselect or Find functions of the Sequence class?

One more question, is there any particular function in the chipseq package
or in an other package to help us to use the input data, taking it into
account to analyse the real signal?

Many thanks in advance,
Florence

> sessionInfo()
R version 2.10.0 alpha (2009-10-04 r49926)
x86_64-unknown-linux-gnu

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] BSgenome.Dmelanogaster.UCSC.dm3_1.3.11
[2] chipseq_0.1.27
[3] ShortRead_1.3.40
[4] lattice_0.17-26
[5] BSgenome_1.13.16
[6] Biostrings_2.13.53
[7] IRanges_1.3.94

loaded via a namespace (and not attached):
[1] Biobase_2.5.8 grid_2.10.0   hwriter_1.1   tools_2.10.0

----------------------------------------
Florence Cavalli
PhD student, Luscombe group
EMBL-European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton
CB10 1SD
Cambridge
UK
email:[email protected] <email%[email protected]>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

[Bioc-sig-seq] coverage ranges included in gene ranges and input data

Reply via email to