Hi Elizabeth,
AFAIK we don't have anything for detecting this kind of "compatibility"
between a gapped read and the exons in a transcript.
Here is a function that detects whether a gapped read is "compatible"
with the exons of a given transcript:
## 'gread' must be a GRanges object of length >= 2 representing the
## gapped read.
## 'exons' must be a GRanges object of length >= 2 representing the
## exons of a given transcript assumed to be already ordered by rank.
## Returns an integer vector of length 2:
## - If the ranges in 'gread' are compatible with 'exons', then
## returns the ranks of the first and last exons spanned by 'gread';
## - Otherwise returns 2 NAs.
gappedReadAndExonsCompatibility <- function(gread, exons)
{
seqname1 <- unique(seqnames(gread))
strand1 <- unique(strand(gread))
seqname2 <- unique(seqnames(exons))
strand2 <- unique(strand(exons))
if (length(seqname1) != 1L || length(strand1) != 1L
|| length(seqname2) != 1L || length(strand2) != 1L)
stop("trans-splicing is not supported, sorry")
if (strand1 == "*" || strand2 == "*")
stop("strand * is not supported (must be + or -)")
## 'gread' and 'exons' are not on the same chromosome/strand:
if (seqname1 != seqname2 || strand1 != strand2)
return(c(NA_integer_, NA_integer_))
if (length(gread) < 2L || length(exons) < 2L)
stop("'length(gread)' and 'length(exons)' must be >= 2")
if (strand1 == "-")
exons <- rev(exons)
## Determine rank of first exon that fits:
first_exon <- which(start(gread)[1L] >= start(exons) &
end(gread)[1L] == end(exons))
## Determine rank of last exon that fits
last_exon <- which(start(gread)[length(gread)] == start(exons) &
end(gread)[length(gread)] <= end(exons))
if (length(first_exon) > 1L)
stop("unsupported splicing: ",
"first range in 'gread' fits more than 1 exon")
if (length(last_exon) > 1L)
stop("unsupported splicing: ",
"last range in 'gread' fits more than 1 exon")
if (length(first_exon) == 0L || is.na(first_exon)
|| length(last_exon) == 0L || is.na(last_exon)
|| length(gread) != last_exon - first_exon + 1L)
return(c(NA_integer_, NA_integer_))
## Check exons between first and last exons:
middle_ranges <- seq_len(length(gread) - 2L) + 1L
middle_exons <- seq_len(length(gread) - 2L) + first_exon
if (any(start(gread)[middle_ranges] != start(exons)[middle_exons])
|| any(end(gread)[middle_ranges] != end(exons)[middle_exons]))
return(c(NA_integer_, NA_integer_))
ans <- c(first_exon, last_exon)
if (strand1 == "-")
ans <- length(exons) - rev(ans) + 1L
ans
}
Then with:
exons <- GRanges(
seqnames=rep("chr1", 6),
ranges=IRanges(
start=c(601,1001,1401,2001,2501,3001),
end=c(700,1090,1600,2100,2800,3050)),
strand="+"
)
gread <- GRanges(
seqnames=rep("chr1", 3),
ranges=IRanges(
start=c(1085,1401,2001),
end=c(1090,1600,2033)),
strand="+"
)
## Compatible and spans exons 2 to 4:
> gappedReadAndExonsCompatibility(gread, exons)
[1] 2 4
## Compatible and spans exons 1 to 3:
> gappedReadAndExonsCompatibility(gread, exons[-1])
[1] 1 3
## Not compatible:
> gappedReadAndExonsCompatibility(gread, exons[-3])
[1] NA NA
So after you've used findOverlaps() to find hits between your gapped
reads and your transcripts, you could do a second pass and filter those
hits using gappedReadAndExonsCompatibility(). The function is not
vectorized to work directly on GRangesList objects so you'll have
to use a loop. But since the space that you explore now is much
reduced (thanks to initial findOverlaps filtering), performance
should still be ok.
Let is know if that works for you (and I could add this utility function
to the GenomicFeatures package).
Cheers,
H.
On 09/13/2010 11:18 AM, Elizabeth Purdom wrote:
Hello,
I am using a TranscriptDb and trying to find overlaps with transcripts.
For example, I have a gapped alignment and I want to see what
transcripts it is compatible with. If txdb is my TranscriptDb, and gr is
my gapped alignment as a GenomicRanges object, I can do findOverlaps to
see if my read overlaps in any way overlaps with the individual exons of
the transcript, but not whether it overlaps with the implied transcript.
For example, if my gapped read overlaps exon 1,2,3 of the transcript, it
can only be compatible if it overlaps in a particular way (it must
contain the end of exon 1, the beginning of exon 3, and all of exon 2).
Is there a way to check this? This is probably answered somewhere, but I
can't seem to find it.
Thanks,
Elizabeth
An example:
> txdb <- loadFeatures(system.file("extdata",
"UCSC_knownGene_sample.sqlite", package="GenomicFeatures"))
> exByTx<-exonsBy(newtxdb$txdb,"tx")
#this is compatible
> grOk<-GRanges(seqnames =c("chr1", "chr1", "chr1"), ranges
=IRanges(c(2000,2476,3084),c(2090,2584,3089)), strand =rep("*",3))
#this is not
> grNotOk<-GRanges(seqnames =c("chr1", "chr1", "chr1"),ranges =
IRanges(c(2000,2500,3084),c(2090,2584,3089)),
strand =rep("*",3))
#both overlap the same set of transcripts, but the the second is not
compatible with either transcript
> findOverlaps(GRangesList(grOk,grNotOk),exByTx)
An object of class "RangesMatching"
Slot "matchMatrix":
query subject
[1,] 1 1
[2,] 1 2
[3,] 2 1
[4,] 2 2
Slot "DIM":
[1] 2 135
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: [email protected]
Phone: (206) 667-5791
Fax: (206) 667-1319
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing