Re: [Bioc-sig-seq] coverage ranges included in gene ranges and input data

Florence Cavalli Fri, 23 Oct 2009 15:04:58 -0700

Hello,

Thanks a lot for your replies. Yes this code makes lot of sense and is
actually what I need.
The uses of these special objects (RangedData, Rle ...) become more clear!


Best,
Florence

2009/10/22 Patrick Aboyoun <[email protected]>

> Florence,
> Below is code I would use to find the median coverage depth within
> specified ranges. This snippet includes a simplification to the coverage
> method that I just checked into svn and so will be available tomorrow at
> bioconductor.org via biocLite. Does this code meet your needs and seem
> reasonable to you?
>
>  suppressMessages(library(IRanges))
>>
>> TSS_Dm_Anno_ch <-
>>
> +   RangedData(IRangesList(
> +              "2L" = IRanges(start=c(2, 3, 5, 8), width=2),
> +              "2R" = IRanges(start=c(1, 4, 6), width=3)))
>
>  polII_cov_wt_ch <-
>>
> +   RangedData(IRangesList(
> +              "2L" = IRanges(start=c(1, 4, 6), width=c(3, 2, 4)),
> +              "2R" = IRanges(start=c(1, 2, 3, 6), width=c(3, 3, 3, 4))),
> +              score = c(2, 4, 7, 3, 1, 1, 1))
>
>  ## requires IRanges >= 1.3.96
>> polIIcoverage <- coverage(polII_cov_wt_ch, weight = "score")
>> ##otherwise:
>> ##polIIcoverage <- coverage(polII_cov_wt_ch, weight =
>> values(polII_cov_wt_ch)[,"score"])
>>
>
>  cov_med <- aggregate(polIIcoverage, ranges(TSS_Dm_Anno_ch), median)
>>
>
>  TSS_Dm_Anno_ch$med_cov <- unlist(cov_med)
>>
>
>  TSS_Dm_Anno_ch
>>
> RangedData with 7 ranges and 1 value column on 2 sequences
> colnames(1): med_cov
> names(2): 2L 2R
>
>  as.data.frame(TSS_Dm_Anno_ch)
>>
>  space start end width med_cov
> 1    2L     2   3     2     2.0
> 2    2L     3   4     2     3.0
> 3    2L     5   6     2     5.5
> 4    2L     8   9     2     7.0
> 5    2R     1   3     3     4.0
> 6    2R     4   6     3     1.0
> 7    2R     6   8     3     1.0
>
>  sessionInfo()
>>
> R version 2.10.0 RC (2009-10-18 r50160) i386-apple-darwin9.8.0
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> other attached packages:
> [1] IRanges_1.3.96
>
>
>
>
>
> Michael Lawrence wrote:
>
>> On Thu, Oct 22, 2009 at 3:38 AM, Florence Cavalli <[email protected]
>> >wrote:
>>
>>
>>
>>> Hello,
>>>
>>> I am working with ChIP-seq data. I want to get the median of the coverage
>>> for each gene that I have in a RangedData.
>>> My coverage is a RangedData object as well. I extracted the ranges of my
>>> two
>>> objects and I used the %in% function but it gives me the ranges that are
>>> totally or partly in a particular gene. I wish to be more restrictive and
>>> to
>>> include only the ranges that are entirely in the gene.
>>>
>>> Here is an example.
>>>
>>>
>>>
>>>> class(TSS_Dm_Anno_ch)
>>>>
>>>>
>>> [1] "RangedData"
>>> attr(,"package")
>>> [1] "IRanges"
>>>
>>>
>>>> region.TU=ranges(TSS_Dm_Anno_ch)
>>>> region.TU
>>>>
>>>>
>>> CompressedIRangesList: 7 ranges
>>> sequences(7): "2L" "2R" "3L" "3R" "4" "U" "X"
>>>
>>>
>>>> class(region.TU)
>>>>
>>>>
>>> [1] "CompressedIRangesList"
>>> attr(,"package")
>>> [1] "IRanges"
>>>
>>>
>>>
>>>> polII_cov_wt_ch
>>>>
>>>>
>>> RangedData: 6931803 ranges by 1 column on 7 sequences
>>> colnames(1): score
>>> names(7): 2L 2R 3L 3R 4 U X
>>>
>>>
>>>> values.ranges=ranges(polII_cov_wt_ch)
>>>> values.ranges
>>>>
>>>>
>>> CompressedIRangesList: 7 ranges
>>> sequences(7): "2L" "2R" "3L" "3R" "4" "U" "X"
>>>
>>> ## Get the borders for each gene, i,e the row numbers in values.ranges
>>> (min
>>> and max) which correspond to ranges that are in the gene (g) range.
>>> borders=lapply(names(values.ranges), function(ch)
>>> t(sapply(1:nrow(as.data.frame(region.TU[[ch]])),function(g)
>>> range(which(values.ranges[[ch]] %in% region.TU[[ch]][g])))))
>>>
>>>
>>>
>>
>> And I use the following function to obtain the median of the coverage on
>> the
>>
>>
>>> ranges included in the borders
>>>
>>> getMedianInRegionFromRangedData <- function(M.values,ranges){
>>>  median.values=sapply(1:nrow(ranges), function(r)
>>> median(M.values[ranges[r,1]:ranges[r,2],]$score))
>>>  return(median.values)
>>> }
>>>
>>>
>>>
>>>
>> Assuming that the RangedData represents a run-length encoded coverage
>> vector, the above code is taking the median of the run values, not the
>> median of the actual coverage. That is, as far as I can tell, you are not
>> weighting the median calculation by the run lengths.
>>
>> Maybe that's intentional, but anyway RangedData is designed for storing
>> variables on a set of ranges. A value of a variable, like score, applies
>> to
>> the range, not to every element of the range. In other words, RangedData
>> is
>> not formally a run-length encoding, so it is not a very convenient
>> representation of coverage.
>>
>> Coverage is more naturally represented as a sequence, e.g. an Rle object.
>> You could construct an RleList from the RangedData (which would be easy if
>> your RangedData has the zero regions). Then form an RleViews(List) using
>> the
>> gene ranges and viewApply() over it.
>>
>> Maybe others have better ideas..
>>
>>
>>
>>
>>>  median.values=unlist(lapply(names(borders), function(ch)
>>> getMedianInRegionFromRangedData(polII_cov_wt_ch[ch],borders[[ch]])))
>>>
>>> I am sure my code should be improved, using a mapply I guess!?
>>>
>>> I haven't found yet the appropriate function to be more restrictive.
>>> So far, I did it using something like
>>> min(which(start(values.ranges[[ch]])>start(region.TU[[ch]])[g])) and
>>> max(which(start(values.ranges[[ch]])<end(region.TU[[ch]])[g]))-1 to get
>>> these borders but it's not efficient at all.
>>> Should I use a seqselect or Find functions of the Sequence class?
>>>
>>> One more question, is there any particular function in the chipseq
>>> package
>>> or in an other package to help us to use the input data, taking it into
>>> account to analyse the real signal?
>>>
>>> Many thanks in advance,
>>> Florence
>>>
>>>
>>>
>>>> sessionInfo()
>>>>
>>>>
>>> R version 2.10.0 alpha (2009-10-04 r49926)
>>> x86_64-unknown-linux-gnu
>>>
>>> locale:
>>>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>>>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>>>  [5] LC_MONETARY=C              LC_MESSAGES=en_GB.UTF-8
>>>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] BSgenome.Dmelanogaster.UCSC.dm3_1.3.11
>>> [2] chipseq_0.1.27
>>> [3] ShortRead_1.3.40
>>> [4] lattice_0.17-26
>>> [5] BSgenome_1.13.16
>>> [6] Biostrings_2.13.53
>>> [7] IRanges_1.3.94
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Biobase_2.5.8 grid_2.10.0   hwriter_1.1   tools_2.10.0
>>>
>>> ----------------------------------------
>>> Florence Cavalli
>>> PhD student, Luscombe group
>>> EMBL-European Bioinformatics Institute
>>> Wellcome Trust Genome Campus
>>> Hinxton
>>> CB10 1SD
>>> Cambridge
>>> UK
>>> email:[email protected] <email%[email protected]> <
>>> email%[email protected] <email%[email protected]>> <
>>> email%[email protected] <email%[email protected]> <
>>> email%[email protected] <email%[email protected]>>>
>>>
>>>       [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> [email protected]
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>>
>>>
>>
>>        [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-sig-sequencing mailing list
>> [email protected]
>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>
>>
>
>


-- 
----------------------------------------
Florence Cavalli
PhD student, Luscombe group
EMBL-European Bioinformatics Institute
Wellcome Trust Genome Campus
Hinxton
CB10 1SD
Cambridge
UK
email:[email protected] <email%[email protected]>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] coverage ranges included in gene ranges and input data

Reply via email to