On Thu, May 21, 2009 at 12:34 PM, Ivan Gregoretti <[email protected]>wrote:
> Hi Michael,
>
> I read rtracklayer.pdf and I got a partial victory.
> Can you show us with an example an example?
>
> For instance, in my genomeIntervals workflow I do
>
> 1)
>
> > # read in wild type
> > A <- read.table("./wildtype_peaks.xls",
> + header=FALSE, skip=14,
> + col.names=c("chr", "start", "end", "length", "summit",
> + "tags", "Tpvalue", "fold_enrichment", "FDR"))
>
> 2)
>
> > # converting the MACS output to a genomeIntervals object
> > giA <- new("Genome_intervals", as.matrix(A[ ,2:3]), closed=c(TRUE,TRUE),
> annotation=data.frame(seq_name=A[ ,1], inter_base=FALSE, A[ ,4:9]))
>
Something like:
giA <- with(A, RangedData(IRanges(start, end), A[,4:9], space = chr)
> 3)
>
> > head(giA)
> Object of class Genome_intervals
> 6 base intervals and 0 inter-base intervals(*):
> chr1 [3660782, 3662707]
> chr1 [4481743, 4482656]
> chr1 [4482814, 4484003]
> chr1 [4561321, 4562262]
> chr1 [4774888, 4776304]
> chr1 [4797292, 4798822]
> annotation:
> seq_name inter_base length summit tags Tpvalue fold_enrichment FDR
> 1 chr1 FALSE 1926 749 64 492.04 38.44 0.11
> 2 chr1 FALSE 914 179 14 83.64 10.01 0.88
> 3 chr1 FALSE 1190 671 18 108.48 16.43 0.17
> 4 chr1 FALSE 942 472 13 136.26 30.96 0.04
> 5 chr1 FALSE 1417 781 60 674.41 78.07 0.15
> 6 chr1 FALSE 1531 813 56 382.46 42.73 0.06
>
>
To show the first few rows? The RangedData has a more condensed show()
method, but you can always convert to a data.frame with as.data.frame().
>
> can you show us how to do steps 2) and 3) your way?
>
> Then, with genomeIntervals I would find overlapping peaks like this:
> intervals_overlap(giaA, giB)
>
Something like:
overlap(giaA, giB)
IRanges does not have a vignette, but it does have complete man pages. Check
out help(overlap) and help(RangedData) for more details.
>
> Can you show us how to do that the non-genomeIntervals way?
>
> Thank you
>
> Ivan
>
>
> Ivan Gregoretti, PhD
> National Institute of Diabetes and Digestive and Kidney Diseases
> National Institutes of Health
> 5 Memorial Dr, Building 5, Room 205.
> Bethesda, MD 20892. USA.
> Phone: 1-301-496-1592
> Fax: 1-301-496-9878
>
>
> On Thu, May 21, 2009 at 2:15 PM, Michael Lawrence <[email protected]>wrote:
>
>>
>>
>> 2009/5/21 Ivan Gregoretti <[email protected]>
>>
>>> Hello Hervé,
>>>
>>> With your help and a tip from Julien Gagneur, the author of
>>> genomeIntervals
>>> I got it to work very nicely.
>>>
>>> So, in a nutshell, be aware that there are two objects in this package:
>>> Genome_intervals_stranded and Genome_intervals. If you are chipseqing,
>>> you
>>> need the second one. (Don't expect a call to strand() to work then!!)
>>>
>>> Also, I want to clarify that although IRanges is the king of simplicity,
>>
>>
>> IRanges is anything but simple, and is hiding behind its "infrastructure
>> package" status to avoid providing a vignette that would advertise all of
>> its features.
>>
>> In IRanges, there is a class called RangesList that is useful for holding
>> ranges on separate chromosomes (we use the more general term 'spaces'). One
>> can call overlap() on a RangesList, and it will match up the chromosomes by
>> name and return the overlapping regions as a RangesMatchingList. For certain
>> operations there are conveniences. For example, %in% can be used to return a
>> logical vector indicating whether a range in one list overlaps one in
>> another list (in the devel version).
>>
>> As for holding extra variables, like a genomic annotation "track", this is
>> the RangedData class. One can create a RangedData directly or import one
>> from BED, GFF or WIG files with the rtracklayer package. The rtracklayer
>> vignette describes the RangedData class fairly well, even though it's
>> actually defined by IRanges.
>>
>> For example, I downloaded the repeat-masker BED track directly from UCSC
>> using rtracklayer, which yields a RangedData. I then created a RangedData of
>> chip-seq peaks using ShortRead, and coverage+slice in the IRanges package.
>> To see which peaks overlap a repeat region across an entire genome, and
>> store it as a variable:
>>
>> peaks$in.repeat <- ranges(peaks) %in% ranges(repeats)
>>
>> The IRanges developers apologize for not getting the vignette put together
>> in a timely fashion. Hopefully we will have one soon.
>>
>> Michael
>>
>>
>>>
>>> this time it should be a second best option. Chipseqers need to compare
>>> multiple chromosomes at the same time and also keep track of annotation
>>> features. genomeIntervals was designed with that in mind.
>>>
>>> You showed us a very nicely primer on overlapping analysis with IRanges.
>>> That illustrates that IRanges can be used to create funtions for ranges
>>> comparision, however, that would be like re-writing genomeIntervals.
>>>
>>> At the end of this message, find a copy of a simple workflow to read
>>> peaks
>>> from the MACS 1.3.5 program output and compute the number of overlapping
>>> peaks. (Why using MACS instead of bioconductor's chipseq should be
>>> another
>>> discussion.)
>>>
>>> Thank you!
>>>
>>> Ivan
>>>
>>> Ivan Gregoretti, PhD
>>> National Institute of Diabetes and Digestive and Kidney Diseases
>>> National Institutes of Health
>>> 5 Memorial Dr, Building 5, Room 205.
>>> Bethesda, MD 20892. USA.
>>> Phone: 1-301-496-1592
>>> Fax: 1-301-496-9878
>>>
>>>
>>> *********************************************************************************
>>>
>>> > require(genomeIntervals)
>>> Loading required package: genomeIntervals
>>> Loading required package: intervals
>>> Loading required package: Biobase
>>>
>>> Welcome to Bioconductor
>>>
>>> Vignettes contain introductory material. To view, type
>>> 'openVignette()'. To cite Bioconductor, see
>>> 'citation("Biobase")' and for packages 'citation(pkgname)'.
>>>
>>> >
>>> > # read in wild type
>>> > A <- read.table("./wildtype_peaks.xls",
>>> + header=FALSE, skip=14,
>>> + col.names=c("chr", "start", "end", "length", "summit",
>>> + "tags", "Tpvalue", "fold_enrichment",
>>> "FDR"))
>>> > # read in mutant
>>> > B <- read.table("./mutant_peaks.xls",
>>> + header=FALSE, skip=14,
>>> + col.names=c("chr", "start", "end", "length", "summit",
>>> + "tags", "Tpvalue", "fold_enrichment",
>>> "FDR"))
>>> >
>>> >
>>> > # select genomic pieces of interest (no random or mitochondrial chr)
>>> > myChromosomes <- c( 'chr1', 'chr2', 'chr3', 'chr4', 'chr5',
>>> + 'chr6', 'chr7', 'chr8', 'chr9', 'chr10',
>>> + 'chr11', 'chr12', 'chr13', 'chr14', 'chr15',
>>> + 'chr16', 'chr17', 'chr18', 'chr19',
>>> + 'chrX', 'chrY')
>>> > A <- A[ A$chr %in% myChromosomes, ]
>>> > B <- B[ A$chr %in% myChromosomes, ]
>>> >
>>> >
>>> > # remove unused factors
>>> > A[] <- lapply(A, "[", drop=TRUE)
>>> > B[] <- lapply(B, "[", drop=TRUE)
>>> >
>>> >
>>> > # peek at the peaks (as data.frame)
>>> > head(A)
>>> chr start end length summit tags Tpvalue fold_enrichment FDR
>>> 1 chr1 3660782 3662707 1926 749 64 492.04 38.44 0.11
>>> 2 chr1 4481743 4482656 914 179 14 83.64 10.01 0.88
>>> 3 chr1 4482814 4484003 1190 671 18 108.48 16.43 0.17
>>> 4 chr1 4561321 4562262 942 472 13 136.26 30.96 0.04
>>> 5 chr1 4774888 4776304 1417 781 60 674.41 78.07 0.15
>>> 6 chr1 4797292 4798822 1531 813 56 382.46 42.73 0.06
>>> > head(B)
>>> chr start end length summit tags Tpvalue fold_enrichment FDR
>>> 1 chr1 3660943 3662070 1128 316 60 750.21 95.13 0.18
>>> 2 chr1 4483371 4483833 463 183 12 90.00 15.42 0.11
>>> 3 chr1 4561625 4562191 567 312 19 169.06 26.97 0.04
>>> 4 chr1 4774935 4776419 1485 734 64 849.54 197.77 0.18
>>> 5 chr1 4797256 4798908 1653 820 85 789.10 102.26 0.15
>>> 6 chr1 4847116 4848877 1762 1038 67 300.77 20.93 0.06
>>> >
>>> >
>>> > # total number of peaks
>>> > dim(A)[1]
>>> [1] 14848
>>> > dim(B)[1]
>>> [1] 15668
>>> >
>>> >
>>> > # peaks at FDR equal or less than 5%
>>> > dim(A[A$FDR <= 0.05, ])[1]
>>> [1] 6330
>>> > dim(B[A$FDR <= 0.05, ])[1]
>>> [1] 6665
>>> >
>>> >
>>> > ## histogram of peak widths (for FDR equal or less than 5%)
>>> > #hist(A[A$FDR <= 0.05, ]$length)
>>> > #hist(B[B$FDR <= 0.05, ]$length)
>>> >
>>> > # converting the MACS output to a genomeIntervals object
>>> > giA <- new("Genome_intervals", as.matrix(A[ ,2:3]),
>>> closed=c(TRUE,TRUE),
>>> annotation=data.frame(seq_name=A[ ,1], inter_base=FALSE, A[ ,4:9]))
>>> > giB <- new("Genome_intervals", as.matrix(B[ ,2:3]),
>>> closed=c(TRUE,TRUE),
>>> annotation=data.frame(seq_name=B[ ,1], inter_base=FALSE, B[ ,4:9]))
>>> >
>>> > # number of peaks in giA that are overlapped by at least one peak in
>>> giB
>>> > length(which(lapply(interval_overlap(giA, giB), function(x) is.na
>>> (x[1]))
>>> == FALSE))
>>> [1] 13265
>>> > # number of peaks in giA that are NOT overlapped by any peak in giB
>>> > length(which(lapply(interval_overlap(giA, giB), function(x) is.na
>>> (x[1]))
>>> == TRUE))
>>> [1] 1583
>>>
>>> [[alternative HTML version deleted]]
>>>
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> [email protected]
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>>
>>
>
[[alternative HTML version deleted]]
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing