Re: [Bioc-sig-seq] Comparing two chipseq position sets

Michael Lawrence Thu, 21 May 2009 11:18:04 -0700

2009/5/21 Ivan Gregoretti <[email protected]>

> Hello Hervé,
>
> With your help and a tip from Julien Gagneur, the author of genomeIntervals
> I got it to work very nicely.
>
> So, in a nutshell, be aware that there are two objects in this package:
> Genome_intervals_stranded and Genome_intervals. If you are chipseqing, you
> need the second one. (Don't expect a call to strand() to work then!!)
>
> Also, I want to clarify that although IRanges is the king of simplicity,



IRanges is anything but simple, and is hiding behind its "infrastructure
package" status to avoid providing a vignette that would advertise all of
its features.

In IRanges, there is a class called RangesList that is useful for holding
ranges on separate chromosomes (we use the more general term 'spaces'). One
can call overlap() on a RangesList, and it will match up the chromosomes by
name and return the overlapping regions as a RangesMatchingList. For certain
operations there are conveniences. For example, %in% can be used to return a
logical vector indicating whether a range in one list overlaps one in
another list (in the devel version).

As for holding extra variables, like a genomic annotation "track", this is
the RangedData class. One can create a RangedData directly or import one
from BED, GFF or WIG files with the rtracklayer package. The rtracklayer
vignette describes the RangedData class fairly well, even though it's
actually defined by IRanges.

For example, I downloaded the repeat-masker BED track directly from UCSC
using rtracklayer, which yields a RangedData. I then created a RangedData of
chip-seq peaks using ShortRead, and coverage+slice in the IRanges package.
To see which peaks overlap a repeat region across an entire genome, and
store it as a variable:

peaks$in.repeat <- ranges(peaks) %in% ranges(repeats)

The IRanges developers apologize for not getting the vignette put together
in a timely fashion. Hopefully we will have one soon.

Michael


>
> this time it should be a second best option. Chipseqers need to compare
> multiple chromosomes at the same time and also keep track of annotation
> features. genomeIntervals was designed with that in mind.
>
> You showed us a very nicely primer on overlapping analysis with IRanges.
> That illustrates that IRanges can be used to create funtions for ranges
> comparision, however, that would be like re-writing genomeIntervals.
>
> At the end of this message, find a copy of a simple workflow to read peaks
> from the MACS 1.3.5 program output and compute the number of overlapping
> peaks. (Why using MACS instead of bioconductor's chipseq should be another
> discussion.)
>
> Thank you!
>
> Ivan
>
> Ivan Gregoretti, PhD
> National Institute of Diabetes and Digestive and Kidney Diseases
> National Institutes of Health
> 5 Memorial Dr, Building 5, Room 205.
> Bethesda, MD 20892. USA.
> Phone: 1-301-496-1592
> Fax: 1-301-496-9878
>
>
> *********************************************************************************
>
> > require(genomeIntervals)
> Loading required package: genomeIntervals
> Loading required package: intervals
> Loading required package: Biobase
>
> Welcome to Bioconductor
>
>  Vignettes contain introductory material. To view, type
>  'openVignette()'. To cite Bioconductor, see
>  'citation("Biobase")' and for packages 'citation(pkgname)'.
>
> >
> > # read in wild type
> > A <- read.table("./wildtype_peaks.xls",
> +                 header=FALSE, skip=14,
> +                 col.names=c("chr", "start", "end", "length", "summit",
> +                             "tags", "Tpvalue", "fold_enrichment", "FDR"))
> > # read in mutant
> > B <- read.table("./mutant_peaks.xls",
> +                 header=FALSE, skip=14,
> +                 col.names=c("chr", "start", "end", "length", "summit",
> +                             "tags", "Tpvalue", "fold_enrichment", "FDR"))
> >
> >
> > # select genomic pieces of interest (no random or mitochondrial chr)
> > myChromosomes <- c( 'chr1',  'chr2',  'chr3',  'chr4',  'chr5',
> +                     'chr6',  'chr7',  'chr8',  'chr9', 'chr10',
> +                    'chr11', 'chr12', 'chr13', 'chr14', 'chr15',
> +                    'chr16', 'chr17', 'chr18', 'chr19',
> +                     'chrX',  'chrY')
> > A <- A[ A$chr %in% myChromosomes, ]
> > B <- B[ A$chr %in% myChromosomes, ]
> >
> >
> > # remove unused factors
> > A[] <- lapply(A, "[", drop=TRUE)
> > B[] <- lapply(B, "[", drop=TRUE)
> >
> >
> > # peek at the peaks (as data.frame)
> > head(A)
>    chr   start     end length summit tags Tpvalue fold_enrichment  FDR
> 1 chr1 3660782 3662707   1926    749   64  492.04           38.44 0.11
> 2 chr1 4481743 4482656    914    179   14   83.64           10.01 0.88
> 3 chr1 4482814 4484003   1190    671   18  108.48           16.43 0.17
> 4 chr1 4561321 4562262    942    472   13  136.26           30.96 0.04
> 5 chr1 4774888 4776304   1417    781   60  674.41           78.07 0.15
> 6 chr1 4797292 4798822   1531    813   56  382.46           42.73 0.06
> > head(B)
>    chr   start     end length summit tags Tpvalue fold_enrichment  FDR
> 1 chr1 3660943 3662070   1128    316   60  750.21           95.13 0.18
> 2 chr1 4483371 4483833    463    183   12   90.00           15.42 0.11
> 3 chr1 4561625 4562191    567    312   19  169.06           26.97 0.04
> 4 chr1 4774935 4776419   1485    734   64  849.54          197.77 0.18
> 5 chr1 4797256 4798908   1653    820   85  789.10          102.26 0.15
> 6 chr1 4847116 4848877   1762   1038   67  300.77           20.93 0.06
> >
> >
> > # total number of peaks
> > dim(A)[1]
> [1] 14848
> > dim(B)[1]
> [1] 15668
> >
> >
> > # peaks at FDR equal or less than 5%
> > dim(A[A$FDR <= 0.05, ])[1]
> [1] 6330
> > dim(B[A$FDR <= 0.05, ])[1]
> [1] 6665
> >
> >
> > ## histogram of peak widths (for FDR equal or less than 5%)
> > #hist(A[A$FDR <= 0.05, ]$length)
> > #hist(B[B$FDR <= 0.05, ]$length)
> >
> > # converting the MACS output to a genomeIntervals object
> > giA <- new("Genome_intervals", as.matrix(A[ ,2:3]), closed=c(TRUE,TRUE),
> annotation=data.frame(seq_name=A[ ,1], inter_base=FALSE, A[ ,4:9]))
> > giB <- new("Genome_intervals", as.matrix(B[ ,2:3]), closed=c(TRUE,TRUE),
> annotation=data.frame(seq_name=B[ ,1], inter_base=FALSE, B[ ,4:9]))
> >
> > # number of peaks in giA that are overlapped by at least one peak in giB
> > length(which(lapply(interval_overlap(giA, giB), function(x) is.na(x[1]))
> == FALSE))
> [1] 13265
> > # number of peaks in giA that are NOT overlapped by any peak in giB
> > length(which(lapply(interval_overlap(giA, giB), function(x) is.na(x[1]))
> == TRUE))
> [1] 1583
>
>        [[alternative HTML version deleted]]
>
>
> _______________________________________________
> Bioc-sig-sequencing mailing list
> [email protected]
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] Comparing two chipseq position sets

Reply via email to