Hello Hervé,
With your help and a tip from Julien Gagneur, the author of genomeIntervals
I got it to work very nicely.
So, in a nutshell, be aware that there are two objects in this package:
Genome_intervals_stranded and Genome_intervals. If you are chipseqing, you
need the second one. (Don't expect a call to strand() to work then!!)
Also, I want to clarify that although IRanges is the king of simplicity,
this time it should be a second best option. Chipseqers need to compare
multiple chromosomes at the same time and also keep track of annotation
features. genomeIntervals was designed with that in mind.
You showed us a very nicely primer on overlapping analysis with IRanges.
That illustrates that IRanges can be used to create funtions for ranges
comparision, however, that would be like re-writing genomeIntervals.
At the end of this message, find a copy of a simple workflow to read peaks
from the MACS 1.3.5 program output and compute the number of overlapping
peaks. (Why using MACS instead of bioconductor's chipseq should be another
discussion.)
Thank you!
Ivan
Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1592
Fax: 1-301-496-9878
*********************************************************************************
> require(genomeIntervals)
Loading required package: genomeIntervals
Loading required package: intervals
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material. To view, type
'openVignette()'. To cite Bioconductor, see
'citation("Biobase")' and for packages 'citation(pkgname)'.
>
> # read in wild type
> A <- read.table("./wildtype_peaks.xls",
+ header=FALSE, skip=14,
+ col.names=c("chr", "start", "end", "length", "summit",
+ "tags", "Tpvalue", "fold_enrichment", "FDR"))
> # read in mutant
> B <- read.table("./mutant_peaks.xls",
+ header=FALSE, skip=14,
+ col.names=c("chr", "start", "end", "length", "summit",
+ "tags", "Tpvalue", "fold_enrichment", "FDR"))
>
>
> # select genomic pieces of interest (no random or mitochondrial chr)
> myChromosomes <- c( 'chr1', 'chr2', 'chr3', 'chr4', 'chr5',
+ 'chr6', 'chr7', 'chr8', 'chr9', 'chr10',
+ 'chr11', 'chr12', 'chr13', 'chr14', 'chr15',
+ 'chr16', 'chr17', 'chr18', 'chr19',
+ 'chrX', 'chrY')
> A <- A[ A$chr %in% myChromosomes, ]
> B <- B[ A$chr %in% myChromosomes, ]
>
>
> # remove unused factors
> A[] <- lapply(A, "[", drop=TRUE)
> B[] <- lapply(B, "[", drop=TRUE)
>
>
> # peek at the peaks (as data.frame)
> head(A)
chr start end length summit tags Tpvalue fold_enrichment FDR
1 chr1 3660782 3662707 1926 749 64 492.04 38.44 0.11
2 chr1 4481743 4482656 914 179 14 83.64 10.01 0.88
3 chr1 4482814 4484003 1190 671 18 108.48 16.43 0.17
4 chr1 4561321 4562262 942 472 13 136.26 30.96 0.04
5 chr1 4774888 4776304 1417 781 60 674.41 78.07 0.15
6 chr1 4797292 4798822 1531 813 56 382.46 42.73 0.06
> head(B)
chr start end length summit tags Tpvalue fold_enrichment FDR
1 chr1 3660943 3662070 1128 316 60 750.21 95.13 0.18
2 chr1 4483371 4483833 463 183 12 90.00 15.42 0.11
3 chr1 4561625 4562191 567 312 19 169.06 26.97 0.04
4 chr1 4774935 4776419 1485 734 64 849.54 197.77 0.18
5 chr1 4797256 4798908 1653 820 85 789.10 102.26 0.15
6 chr1 4847116 4848877 1762 1038 67 300.77 20.93 0.06
>
>
> # total number of peaks
> dim(A)[1]
[1] 14848
> dim(B)[1]
[1] 15668
>
>
> # peaks at FDR equal or less than 5%
> dim(A[A$FDR <= 0.05, ])[1]
[1] 6330
> dim(B[A$FDR <= 0.05, ])[1]
[1] 6665
>
>
> ## histogram of peak widths (for FDR equal or less than 5%)
> #hist(A[A$FDR <= 0.05, ]$length)
> #hist(B[B$FDR <= 0.05, ]$length)
>
> # converting the MACS output to a genomeIntervals object
> giA <- new("Genome_intervals", as.matrix(A[ ,2:3]), closed=c(TRUE,TRUE),
annotation=data.frame(seq_name=A[ ,1], inter_base=FALSE, A[ ,4:9]))
> giB <- new("Genome_intervals", as.matrix(B[ ,2:3]), closed=c(TRUE,TRUE),
annotation=data.frame(seq_name=B[ ,1], inter_base=FALSE, B[ ,4:9]))
>
> # number of peaks in giA that are overlapped by at least one peak in giB
> length(which(lapply(interval_overlap(giA, giB), function(x) is.na(x[1]))
== FALSE))
[1] 13265
> # number of peaks in giA that are NOT overlapped by any peak in giB
> length(which(lapply(interval_overlap(giA, giB), function(x) is.na(x[1]))
== TRUE))
[1] 1583
[[alternative HTML version deleted]]
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing