Re: [Bioc-sig-seq] Comparing two chipseq position sets

Michael Lawrence Thu, 21 May 2009 14:38:21 -0700

On Thu, May 21, 2009 at 12:34 PM, Ivan Gregoretti <[email protected]>wrote:


> Hi Michael,
>
> I read rtracklayer.pdf and I got a partial victory.
> Can you show us with an example an example?
>
> For instance, in my genomeIntervals workflow I do
>
> 1)
>
> > # read in wild type
> > A <- read.table("./wildtype_peaks.xls",
> +                 header=FALSE, skip=14,
> +                 col.names=c("chr", "start", "end", "length", "summit",
> +                             "tags", "Tpvalue", "fold_enrichment", "FDR"))
>
> 2)
>
> > # converting the MACS output to a genomeIntervals object
> > giA <- new("Genome_intervals", as.matrix(A[ ,2:3]), closed=c(TRUE,TRUE),
> annotation=data.frame(seq_name=A[ ,1], inter_base=FALSE, A[ ,4:9]))
>

Something like:
giA <- with(A, RangedData(IRanges(start, end), A[,4:9], space = chr)


> 3)
>
> > head(giA)
> Object of class Genome_intervals
> 6 base intervals and 0 inter-base intervals(*):
> chr1 [3660782, 3662707]
> chr1 [4481743, 4482656]
> chr1 [4482814, 4484003]
> chr1 [4561321, 4562262]
> chr1 [4774888, 4776304]
> chr1 [4797292, 4798822]
> annotation:
>   seq_name inter_base length summit tags Tpvalue fold_enrichment  FDR
> 1     chr1      FALSE   1926    749   64  492.04           38.44 0.11
> 2     chr1      FALSE    914    179   14   83.64           10.01 0.88
> 3     chr1      FALSE   1190    671   18  108.48           16.43 0.17
> 4     chr1      FALSE    942    472   13  136.26           30.96 0.04
> 5     chr1      FALSE   1417    781   60  674.41           78.07 0.15
> 6     chr1      FALSE   1531    813   56  382.46           42.73 0.06
>
>
To show the first few rows? The RangedData has a more condensed show()
method, but you can always convert to a data.frame with as.data.frame().


>
> can you show us how to do steps 2) and 3) your way?
>
> Then, with genomeIntervals I would find overlapping peaks like this:
> intervals_overlap(giaA, giB)
>

Something like:
overlap(giaA, giB)

IRanges does not have a vignette, but it does have complete man pages. Check
out help(overlap) and help(RangedData) for more details.


>
> Can you show us how to do that the non-genomeIntervals way?
>
> Thank you
>
> Ivan
>
>
> Ivan Gregoretti, PhD
> National Institute of Diabetes and Digestive and Kidney Diseases
> National Institutes of Health
> 5 Memorial Dr, Building 5, Room 205.
> Bethesda, MD 20892. USA.
> Phone: 1-301-496-1592
> Fax: 1-301-496-9878
>
>
> On Thu, May 21, 2009 at 2:15 PM, Michael Lawrence <[email protected]>wrote:
>
>>
>>
>> 2009/5/21 Ivan Gregoretti <[email protected]>
>>
>>> Hello Hervé,
>>>
>>> With your help and a tip from Julien Gagneur, the author of
>>> genomeIntervals
>>> I got it to work very nicely.
>>>
>>> So, in a nutshell, be aware that there are two objects in this package:
>>> Genome_intervals_stranded and Genome_intervals. If you are chipseqing,
>>> you
>>> need the second one. (Don't expect a call to strand() to work then!!)
>>>
>>> Also, I want to clarify that although IRanges is the king of simplicity,
>>
>>
>> IRanges is anything but simple, and is hiding behind its "infrastructure
>> package" status to avoid providing a vignette that would advertise all of
>> its features.
>>
>> In IRanges, there is a class called RangesList that is useful for holding
>> ranges on separate chromosomes (we use the more general term 'spaces'). One
>> can call overlap() on a RangesList, and it will match up the chromosomes by
>> name and return the overlapping regions as a RangesMatchingList. For certain
>> operations there are conveniences. For example, %in% can be used to return a
>> logical vector indicating whether a range in one list overlaps one in
>> another list (in the devel version).
>>
>> As for holding extra variables, like a genomic annotation "track", this is
>> the RangedData class. One can create a RangedData directly or import one
>> from BED, GFF or WIG files with the rtracklayer package. The rtracklayer
>> vignette describes the RangedData class fairly well, even though it's
>> actually defined by IRanges.
>>
>> For example, I downloaded the repeat-masker BED track directly from UCSC
>> using rtracklayer, which yields a RangedData. I then created a RangedData of
>> chip-seq peaks using ShortRead, and coverage+slice in the IRanges package.
>> To see which peaks overlap a repeat region across an entire genome, and
>> store it as a variable:
>>
>> peaks$in.repeat <- ranges(peaks) %in% ranges(repeats)
>>
>> The IRanges developers apologize for not getting the vignette put together
>> in a timely fashion. Hopefully we will have one soon.
>>
>> Michael
>>
>>
>>>
>>> this time it should be a second best option. Chipseqers need to compare
>>> multiple chromosomes at the same time and also keep track of annotation
>>> features. genomeIntervals was designed with that in mind.
>>>
>>> You showed us a very nicely primer on overlapping analysis with IRanges.
>>> That illustrates that IRanges can be used to create funtions for ranges
>>> comparision, however, that would be like re-writing genomeIntervals.
>>>
>>> At the end of this message, find a copy of a simple workflow to read
>>> peaks
>>> from the MACS 1.3.5 program output and compute the number of overlapping
>>> peaks. (Why using MACS instead of bioconductor's chipseq should be
>>> another
>>> discussion.)
>>>
>>> Thank you!
>>>
>>> Ivan
>>>
>>> Ivan Gregoretti, PhD
>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>> National Institutes of Health
>>> 5 Memorial Dr, Building 5, Room 205.
>>> Bethesda, MD 20892. USA.
>>> Phone: 1-301-496-1592
>>> Fax: 1-301-496-9878
>>>
>>>
>>> *********************************************************************************
>>>
>>> > require(genomeIntervals)
>>> Loading required package: genomeIntervals
>>> Loading required package: intervals
>>> Loading required package: Biobase
>>>
>>> Welcome to Bioconductor
>>>
>>>  Vignettes contain introductory material. To view, type
>>>  'openVignette()'. To cite Bioconductor, see
>>>  'citation("Biobase")' and for packages 'citation(pkgname)'.
>>>
>>> >
>>> > # read in wild type
>>> > A <- read.table("./wildtype_peaks.xls",
>>> +                 header=FALSE, skip=14,
>>> +                 col.names=c("chr", "start", "end", "length", "summit",
>>> +                             "tags", "Tpvalue", "fold_enrichment",
>>> "FDR"))
>>> > # read in mutant
>>> > B <- read.table("./mutant_peaks.xls",
>>> +                 header=FALSE, skip=14,
>>> +                 col.names=c("chr", "start", "end", "length", "summit",
>>> +                             "tags", "Tpvalue", "fold_enrichment",
>>> "FDR"))
>>> >
>>> >
>>> > # select genomic pieces of interest (no random or mitochondrial chr)
>>> > myChromosomes <- c( 'chr1',  'chr2',  'chr3',  'chr4',  'chr5',
>>> +                     'chr6',  'chr7',  'chr8',  'chr9', 'chr10',
>>> +                    'chr11', 'chr12', 'chr13', 'chr14', 'chr15',
>>> +                    'chr16', 'chr17', 'chr18', 'chr19',
>>> +                     'chrX',  'chrY')
>>> > A <- A[ A$chr %in% myChromosomes, ]
>>> > B <- B[ A$chr %in% myChromosomes, ]
>>> >
>>> >
>>> > # remove unused factors
>>> > A[] <- lapply(A, "[", drop=TRUE)
>>> > B[] <- lapply(B, "[", drop=TRUE)
>>> >
>>> >
>>> > # peek at the peaks (as data.frame)
>>> > head(A)
>>>    chr   start     end length summit tags Tpvalue fold_enrichment  FDR
>>> 1 chr1 3660782 3662707   1926    749   64  492.04           38.44 0.11
>>> 2 chr1 4481743 4482656    914    179   14   83.64           10.01 0.88
>>> 3 chr1 4482814 4484003   1190    671   18  108.48           16.43 0.17
>>> 4 chr1 4561321 4562262    942    472   13  136.26           30.96 0.04
>>> 5 chr1 4774888 4776304   1417    781   60  674.41           78.07 0.15
>>> 6 chr1 4797292 4798822   1531    813   56  382.46           42.73 0.06
>>> > head(B)
>>>    chr   start     end length summit tags Tpvalue fold_enrichment  FDR
>>> 1 chr1 3660943 3662070   1128    316   60  750.21           95.13 0.18
>>> 2 chr1 4483371 4483833    463    183   12   90.00           15.42 0.11
>>> 3 chr1 4561625 4562191    567    312   19  169.06           26.97 0.04
>>> 4 chr1 4774935 4776419   1485    734   64  849.54          197.77 0.18
>>> 5 chr1 4797256 4798908   1653    820   85  789.10          102.26 0.15
>>> 6 chr1 4847116 4848877   1762   1038   67  300.77           20.93 0.06
>>> >
>>> >
>>> > # total number of peaks
>>> > dim(A)[1]
>>> [1] 14848
>>> > dim(B)[1]
>>> [1] 15668
>>> >
>>> >
>>> > # peaks at FDR equal or less than 5%
>>> > dim(A[A$FDR <= 0.05, ])[1]
>>> [1] 6330
>>> > dim(B[A$FDR <= 0.05, ])[1]
>>> [1] 6665
>>> >
>>> >
>>> > ## histogram of peak widths (for FDR equal or less than 5%)
>>> > #hist(A[A$FDR <= 0.05, ]$length)
>>> > #hist(B[B$FDR <= 0.05, ]$length)
>>> >
>>> > # converting the MACS output to a genomeIntervals object
>>> > giA <- new("Genome_intervals", as.matrix(A[ ,2:3]),
>>> closed=c(TRUE,TRUE),
>>> annotation=data.frame(seq_name=A[ ,1], inter_base=FALSE, A[ ,4:9]))
>>> > giB <- new("Genome_intervals", as.matrix(B[ ,2:3]),
>>> closed=c(TRUE,TRUE),
>>> annotation=data.frame(seq_name=B[ ,1], inter_base=FALSE, B[ ,4:9]))
>>> >
>>> > # number of peaks in giA that are overlapped by at least one peak in
>>> giB
>>> > length(which(lapply(interval_overlap(giA, giB), function(x) is.na
>>> (x[1]))
>>> == FALSE))
>>> [1] 13265
>>> > # number of peaks in giA that are NOT overlapped by any peak in giB
>>> > length(which(lapply(interval_overlap(giA, giB), function(x) is.na
>>> (x[1]))
>>> == TRUE))
>>> [1] 1583
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> [email protected]
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>>
>>
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] Comparing two chipseq position sets

Reply via email to