Hello, Thanks a lot for your replies. Yes this code makes lot of sense and is actually what I need. The uses of these special objects (RangedData, Rle ...) become more clear!
Best, Florence 2009/10/22 Patrick Aboyoun <[email protected]> > Florence, > Below is code I would use to find the median coverage depth within > specified ranges. This snippet includes a simplification to the coverage > method that I just checked into svn and so will be available tomorrow at > bioconductor.org via biocLite. Does this code meet your needs and seem > reasonable to you? > > suppressMessages(library(IRanges)) >> >> TSS_Dm_Anno_ch <- >> > + RangedData(IRangesList( > + "2L" = IRanges(start=c(2, 3, 5, 8), width=2), > + "2R" = IRanges(start=c(1, 4, 6), width=3))) > > polII_cov_wt_ch <- >> > + RangedData(IRangesList( > + "2L" = IRanges(start=c(1, 4, 6), width=c(3, 2, 4)), > + "2R" = IRanges(start=c(1, 2, 3, 6), width=c(3, 3, 3, 4))), > + score = c(2, 4, 7, 3, 1, 1, 1)) > > ## requires IRanges >= 1.3.96 >> polIIcoverage <- coverage(polII_cov_wt_ch, weight = "score") >> ##otherwise: >> ##polIIcoverage <- coverage(polII_cov_wt_ch, weight = >> values(polII_cov_wt_ch)[,"score"]) >> > > cov_med <- aggregate(polIIcoverage, ranges(TSS_Dm_Anno_ch), median) >> > > TSS_Dm_Anno_ch$med_cov <- unlist(cov_med) >> > > TSS_Dm_Anno_ch >> > RangedData with 7 ranges and 1 value column on 2 sequences > colnames(1): med_cov > names(2): 2L 2R > > as.data.frame(TSS_Dm_Anno_ch) >> > space start end width med_cov > 1 2L 2 3 2 2.0 > 2 2L 3 4 2 3.0 > 3 2L 5 6 2 5.5 > 4 2L 8 9 2 7.0 > 5 2R 1 3 3 4.0 > 6 2R 4 6 3 1.0 > 7 2R 6 8 3 1.0 > > sessionInfo() >> > R version 2.10.0 RC (2009-10-18 r50160) i386-apple-darwin9.8.0 > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] IRanges_1.3.96 > > > > > > Michael Lawrence wrote: > >> On Thu, Oct 22, 2009 at 3:38 AM, Florence Cavalli <[email protected] >> >wrote: >> >> >> >>> Hello, >>> >>> I am working with ChIP-seq data. I want to get the median of the coverage >>> for each gene that I have in a RangedData. >>> My coverage is a RangedData object as well. I extracted the ranges of my >>> two >>> objects and I used the %in% function but it gives me the ranges that are >>> totally or partly in a particular gene. I wish to be more restrictive and >>> to >>> include only the ranges that are entirely in the gene. >>> >>> Here is an example. >>> >>> >>> >>>> class(TSS_Dm_Anno_ch) >>>> >>>> >>> [1] "RangedData" >>> attr(,"package") >>> [1] "IRanges" >>> >>> >>>> region.TU=ranges(TSS_Dm_Anno_ch) >>>> region.TU >>>> >>>> >>> CompressedIRangesList: 7 ranges >>> sequences(7): "2L" "2R" "3L" "3R" "4" "U" "X" >>> >>> >>>> class(region.TU) >>>> >>>> >>> [1] "CompressedIRangesList" >>> attr(,"package") >>> [1] "IRanges" >>> >>> >>> >>>> polII_cov_wt_ch >>>> >>>> >>> RangedData: 6931803 ranges by 1 column on 7 sequences >>> colnames(1): score >>> names(7): 2L 2R 3L 3R 4 U X >>> >>> >>>> values.ranges=ranges(polII_cov_wt_ch) >>>> values.ranges >>>> >>>> >>> CompressedIRangesList: 7 ranges >>> sequences(7): "2L" "2R" "3L" "3R" "4" "U" "X" >>> >>> ## Get the borders for each gene, i,e the row numbers in values.ranges >>> (min >>> and max) which correspond to ranges that are in the gene (g) range. >>> borders=lapply(names(values.ranges), function(ch) >>> t(sapply(1:nrow(as.data.frame(region.TU[[ch]])),function(g) >>> range(which(values.ranges[[ch]] %in% region.TU[[ch]][g]))))) >>> >>> >>> >> >> And I use the following function to obtain the median of the coverage on >> the >> >> >>> ranges included in the borders >>> >>> getMedianInRegionFromRangedData <- function(M.values,ranges){ >>> median.values=sapply(1:nrow(ranges), function(r) >>> median(M.values[ranges[r,1]:ranges[r,2],]$score)) >>> return(median.values) >>> } >>> >>> >>> >>> >> Assuming that the RangedData represents a run-length encoded coverage >> vector, the above code is taking the median of the run values, not the >> median of the actual coverage. That is, as far as I can tell, you are not >> weighting the median calculation by the run lengths. >> >> Maybe that's intentional, but anyway RangedData is designed for storing >> variables on a set of ranges. A value of a variable, like score, applies >> to >> the range, not to every element of the range. In other words, RangedData >> is >> not formally a run-length encoding, so it is not a very convenient >> representation of coverage. >> >> Coverage is more naturally represented as a sequence, e.g. an Rle object. >> You could construct an RleList from the RangedData (which would be easy if >> your RangedData has the zero regions). Then form an RleViews(List) using >> the >> gene ranges and viewApply() over it. >> >> Maybe others have better ideas.. >> >> >> >> >>> median.values=unlist(lapply(names(borders), function(ch) >>> getMedianInRegionFromRangedData(polII_cov_wt_ch[ch],borders[[ch]]))) >>> >>> I am sure my code should be improved, using a mapply I guess!? >>> >>> I haven't found yet the appropriate function to be more restrictive. >>> So far, I did it using something like >>> min(which(start(values.ranges[[ch]])>start(region.TU[[ch]])[g])) and >>> max(which(start(values.ranges[[ch]])<end(region.TU[[ch]])[g]))-1 to get >>> these borders but it's not efficient at all. >>> Should I use a seqselect or Find functions of the Sequence class? >>> >>> One more question, is there any particular function in the chipseq >>> package >>> or in an other package to help us to use the input data, taking it into >>> account to analyse the real signal? >>> >>> Many thanks in advance, >>> Florence >>> >>> >>> >>>> sessionInfo() >>>> >>>> >>> R version 2.10.0 alpha (2009-10-04 r49926) >>> x86_64-unknown-linux-gnu >>> >>> locale: >>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 >>> [5] LC_MONETARY=C LC_MESSAGES=en_GB.UTF-8 >>> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] BSgenome.Dmelanogaster.UCSC.dm3_1.3.11 >>> [2] chipseq_0.1.27 >>> [3] ShortRead_1.3.40 >>> [4] lattice_0.17-26 >>> [5] BSgenome_1.13.16 >>> [6] Biostrings_2.13.53 >>> [7] IRanges_1.3.94 >>> >>> loaded via a namespace (and not attached): >>> [1] Biobase_2.5.8 grid_2.10.0 hwriter_1.1 tools_2.10.0 >>> >>> ---------------------------------------- >>> Florence Cavalli >>> PhD student, Luscombe group >>> EMBL-European Bioinformatics Institute >>> Wellcome Trust Genome Campus >>> Hinxton >>> CB10 1SD >>> Cambridge >>> UK >>> email:[email protected] <email%[email protected]> < >>> email%[email protected] <email%[email protected]>> < >>> email%[email protected] <email%[email protected]> < >>> email%[email protected] <email%[email protected]>>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioc-sig-sequencing mailing list >>> [email protected] >>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>> >>> >>> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioc-sig-sequencing mailing list >> [email protected] >> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >> >> > > -- ---------------------------------------- Florence Cavalli PhD student, Luscombe group EMBL-European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton CB10 1SD Cambridge UK email:[email protected] <email%[email protected]> [[alternative HTML version deleted]] _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
