Re: [Bioc-sig-seq] coverage ranges included in gene ranges and input data

Michael Lawrence Thu, 22 Oct 2009 08:12:48 -0700

On Thu, Oct 22, 2009 at 3:38 AM, Florence Cavalli <[email protected]>wrote:


> Hello,
>
> I am working with ChIP-seq data. I want to get the median of the coverage
> for each gene that I have in a RangedData.
> My coverage is a RangedData object as well. I extracted the ranges of my
> two
> objects and I used the %in% function but it gives me the ranges that are
> totally or partly in a particular gene. I wish to be more restrictive and
> to
> include only the ranges that are entirely in the gene.
>
> Here is an example.
>
> > class(TSS_Dm_Anno_ch)
> [1] "RangedData"
> attr(,"package")
> [1] "IRanges"
> > region.TU=ranges(TSS_Dm_Anno_ch)
> > region.TU
> CompressedIRangesList: 7 ranges
> sequences(7): "2L" "2R" "3L" "3R" "4" "U" "X"
> > class(region.TU)
> [1] "CompressedIRangesList"
> attr(,"package")
> [1] "IRanges"
>
> > polII_cov_wt_ch
> RangedData: 6931803 ranges by 1 column on 7 sequences
> colnames(1): score
> names(7): 2L 2R 3L 3R 4 U X
> > values.ranges=ranges(polII_cov_wt_ch)
> > values.ranges
> CompressedIRangesList: 7 ranges
> sequences(7): "2L" "2R" "3L" "3R" "4" "U" "X"
>
> ## Get the borders for each gene, i,e the row numbers in values.ranges (min
> and max) which correspond to ranges that are in the gene (g) range.
> borders=lapply(names(values.ranges), function(ch)
> t(sapply(1:nrow(as.data.frame(region.TU[[ch]])),function(g)
> range(which(values.ranges[[ch]] %in% region.TU[[ch]][g])))))
>

And I use the following function to obtain the median of the coverage on the
> ranges included in the borders
>
> getMedianInRegionFromRangedData <- function(M.values,ranges){
>  median.values=sapply(1:nrow(ranges), function(r)
> median(M.values[ranges[r,1]:ranges[r,2],]$score))
>  return(median.values)
> }
>
>
Assuming that the RangedData represents a run-length encoded coverage
vector, the above code is taking the median of the run values, not the
median of the actual coverage. That is, as far as I can tell, you are not
weighting the median calculation by the run lengths.

Maybe that's intentional, but anyway RangedData is designed for storing
variables on a set of ranges. A value of a variable, like score, applies to
the range, not to every element of the range. In other words, RangedData is
not formally a run-length encoding, so it is not a very convenient
representation of coverage.

Coverage is more naturally represented as a sequence, e.g. an Rle object.
You could construct an RleList from the RangedData (which would be easy if
your RangedData has the zero regions). Then form an RleViews(List) using the
gene ranges and viewApply() over it.

Maybe others have better ideas..


>  median.values=unlist(lapply(names(borders), function(ch)
> getMedianInRegionFromRangedData(polII_cov_wt_ch[ch],borders[[ch]])))
>
> I am sure my code should be improved, using a mapply I guess!?
>
> I haven't found yet the appropriate function to be more restrictive.
> So far, I did it using something like
> min(which(start(values.ranges[[ch]])>start(region.TU[[ch]])[g])) and
> max(which(start(values.ranges[[ch]])<end(region.TU[[ch]])[g]))-1 to get
> these borders but it's not efficient at all.
> Should I use a seqselect or Find functions of the Sequence class?
>
> One more question, is there any particular function in the chipseq package
> or in an other package to help us to use the input data, taking it into
> account to analyse the real signal?
>
> Many thanks in advance,
> Florence
>
> > sessionInfo()
> R version 2.10.0 alpha (2009-10-04 r49926)
> x86_64-unknown-linux-gnu
>
> locale:
>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=en_GB.UTF-8
>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] BSgenome.Dmelanogaster.UCSC.dm3_1.3.11
> [2] chipseq_0.1.27
> [3] ShortRead_1.3.40
> [4] lattice_0.17-26
> [5] BSgenome_1.13.16
> [6] Biostrings_2.13.53
> [7] IRanges_1.3.94
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.5.8 grid_2.10.0   hwriter_1.1   tools_2.10.0
>
> ----------------------------------------
> Florence Cavalli
> PhD student, Luscombe group
> EMBL-European Bioinformatics Institute
> Wellcome Trust Genome Campus
> Hinxton
> CB10 1SD
> Cambridge
> UK
> email:[email protected] <email%[email protected]> <
> email%[email protected] <email%[email protected]>>
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> [email protected]
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] coverage ranges included in gene ranges and input data

Reply via email to