Re: [Bioc-sig-seq] Comparing two chipseq position sets

Ivan Gregoretti Thu, 21 May 2009 08:19:26 -0700

Hello Hervé,

With your help and a tip from Julien Gagneur, the author of genomeIntervals
I got it to work very nicely.


So, in a nutshell, be aware that there are two objects in this package:
Genome_intervals_stranded and Genome_intervals. If you are chipseqing, you
need the second one. (Don't expect a call to strand() to work then!!)

Also, I want to clarify that although IRanges is the king of simplicity,
this time it should be a second best option. Chipseqers need to compare
multiple chromosomes at the same time and also keep track of annotation
features. genomeIntervals was designed with that in mind.

You showed us a very nicely primer on overlapping analysis with IRanges.
That illustrates that IRanges can be used to create funtions for ranges
comparision, however, that would be like re-writing genomeIntervals.

At the end of this message, find a copy of a simple workflow to read peaks
from the MACS 1.3.5 program output and compute the number of overlapping
peaks. (Why using MACS instead of bioconductor's chipseq should be another
discussion.)

Thank you!

Ivan

Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1592
Fax: 1-301-496-9878

*********************************************************************************

> require(genomeIntervals)
Loading required package: genomeIntervals
Loading required package: intervals
Loading required package: Biobase

Welcome to Bioconductor

  Vignettes contain introductory material. To view, type
  'openVignette()'. To cite Bioconductor, see
  'citation("Biobase")' and for packages 'citation(pkgname)'.

>
> # read in wild type
> A <- read.table("./wildtype_peaks.xls",
+                 header=FALSE, skip=14,
+                 col.names=c("chr", "start", "end", "length", "summit",
+                             "tags", "Tpvalue", "fold_enrichment", "FDR"))
> # read in mutant
> B <- read.table("./mutant_peaks.xls",
+                 header=FALSE, skip=14,
+                 col.names=c("chr", "start", "end", "length", "summit",
+                             "tags", "Tpvalue", "fold_enrichment", "FDR"))
>
>
> # select genomic pieces of interest (no random or mitochondrial chr)
> myChromosomes <- c( 'chr1',  'chr2',  'chr3',  'chr4',  'chr5',
+                     'chr6',  'chr7',  'chr8',  'chr9', 'chr10',
+                    'chr11', 'chr12', 'chr13', 'chr14', 'chr15',
+                    'chr16', 'chr17', 'chr18', 'chr19',
+                     'chrX',  'chrY')
> A <- A[ A$chr %in% myChromosomes, ]
> B <- B[ A$chr %in% myChromosomes, ]
>
>
> # remove unused factors
> A[] <- lapply(A, "[", drop=TRUE)
> B[] <- lapply(B, "[", drop=TRUE)
>
>
> # peek at the peaks (as data.frame)
> head(A)
   chr   start     end length summit tags Tpvalue fold_enrichment  FDR
1 chr1 3660782 3662707   1926    749   64  492.04           38.44 0.11
2 chr1 4481743 4482656    914    179   14   83.64           10.01 0.88
3 chr1 4482814 4484003   1190    671   18  108.48           16.43 0.17
4 chr1 4561321 4562262    942    472   13  136.26           30.96 0.04
5 chr1 4774888 4776304   1417    781   60  674.41           78.07 0.15
6 chr1 4797292 4798822   1531    813   56  382.46           42.73 0.06
> head(B)
   chr   start     end length summit tags Tpvalue fold_enrichment  FDR
1 chr1 3660943 3662070   1128    316   60  750.21           95.13 0.18
2 chr1 4483371 4483833    463    183   12   90.00           15.42 0.11
3 chr1 4561625 4562191    567    312   19  169.06           26.97 0.04
4 chr1 4774935 4776419   1485    734   64  849.54          197.77 0.18
5 chr1 4797256 4798908   1653    820   85  789.10          102.26 0.15
6 chr1 4847116 4848877   1762   1038   67  300.77           20.93 0.06
>
>
> # total number of peaks
> dim(A)[1]
[1] 14848
> dim(B)[1]
[1] 15668
>
>
> # peaks at FDR equal or less than 5%
> dim(A[A$FDR <= 0.05, ])[1]
[1] 6330
> dim(B[A$FDR <= 0.05, ])[1]
[1] 6665
>
>
> ## histogram of peak widths (for FDR equal or less than 5%)
> #hist(A[A$FDR <= 0.05, ]$length)
> #hist(B[B$FDR <= 0.05, ]$length)
>
> # converting the MACS output to a genomeIntervals object
> giA <- new("Genome_intervals", as.matrix(A[ ,2:3]), closed=c(TRUE,TRUE),
annotation=data.frame(seq_name=A[ ,1], inter_base=FALSE, A[ ,4:9]))
> giB <- new("Genome_intervals", as.matrix(B[ ,2:3]), closed=c(TRUE,TRUE),
annotation=data.frame(seq_name=B[ ,1], inter_base=FALSE, B[ ,4:9]))
>
> # number of peaks in giA that are overlapped by at least one peak in giB
> length(which(lapply(interval_overlap(giA, giB), function(x) is.na(x[1]))
== FALSE))
[1] 13265
> # number of peaks in giA that are NOT overlapped by any peak in giB
> length(which(lapply(interval_overlap(giA, giB), function(x) is.na(x[1]))
== TRUE))
[1] 1583

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] Comparing two chipseq position sets

Reply via email to