Hi Patrick
Patrick Aboyoun wrote:
I've made further changes to the IRanges package to make it easier to
perform the coverage operations across chromosomes/contigs. I've added
RleViewsList (virtual) and SimpleRleViewsList (non-virtual) classes to
IRanges as well as a RleViewsList constructor and viewSums and other
view* summary methods for RleViewList objects. The code below looks much
cleaner now due to the RleViewsList constructor and the viewSums method.
Also, I point out the viewApply function, which serves in the capacity
of the aggregate function that you were experimenting with in your
e-mails. Do these steps fit into your workflow?
Given this, it seems to me now that viewApply and coverage have very
similar use cases. Do we need both?
I am still quite worried that nobody outside the Hutch will be able to
figure out which class to use when. There are simply so many classes in
IRanges. Do you have a class hierarchy diagram?
A justification for all the RangedData-style classes is, of course, that
one usually deals with multiple chromosomes. So, I went through the
example of my last mail again, now with two chromosomes.
Last time, we had this here, a RangedData with a single, unnamed
sequence, and one column, 'score':
v3 <- RangedData(
+ IRanges( start = c( 1, 4, 6 ), width = c( 3, 2, 4 ) ),
+ score = c( 2, 7, 3 ) )
v3
RangedData: 3 ranges by 1 column on 1 sequence
colnames(1): score
Now, I attempt to make a RangedData with two sequences, 'chrA' and 'chrB':
v3b <- RangedData(
+ RangesList(
+ chrA = IRanges(start = c(1, 4, 6), width=c(3, 2, 4)),
+ chrB = IRanges(start = c(1, 3, 6), width=c(3, 3, 4)) ),
+ score = c( 2, 7, 3, 1, 1, 1 ) )
v3b
RangedData: 6 ranges by 0 columns on 2 sequences
colnames(0):
names(2): chrA chrB
As you can see, the 'score' argument was ignored. I've tried a couple of
other syntaxes, but could not get it in. What's the trick?
To proceed, I add my scores manually:
v3b$score <- c( 2, 7, 3, 1, 1, 1 )
I still feel a bit uneasy about the fact that the split between the
sequences is not apparent here, but maybe it is simplest like this.
I now switch over to coverage vectors:
v2b <- coverage( v3b, weight="score" )
This has now nicely assembled everything over both chromosomes:
as.vector( v2b$chrA )
[1] 2 2 2 7 7 3 3 3 3
as.vector( v2b$chrB )
[1] 1 1 2 1 1 1 1 1 1
Let's get some exons, two on chrA and one on chrB:
exons <- RangesList(
+ chrA = IRanges( start = c(2, 4), width = c(2, 2) ),
+ chrB = IRanges( start = 3, width = 5 ) )
Now, 'aggregate' fails, with a rather misleading error message:
aggregate( v2b, exons, sum )
Error in solveWindowSEW(length(x), start = ifelse(is.null(start), NA, :
Invalid sequence coordinates.
Please make sure the supplied 'start', 'end' and 'width' arguments
are defining a region that is within the limits of the sequence.
So, I'll try with views:
v2exons <- RleViewsList( rleList=v2b, rangesList=exons )
viewSums( v2exons )
SimpleIntegerList: 2 elements
viewSums( v2exons )[[1]]
[1] 4 14
viewSums( v2exons )[[2]]
[1] 6
This worked, but we lost the sequence names:
viewSums( v2exons )[["chrA"]]
NULL
viewSums( v2exons )$chrA
NULL
I suppose that we again rely here on the sequences being in the same
order in v2b and exons, i.e., if one has 'chrA' and 'chrB', and the
other, say, 'chrB' and 'chrC', this would not get spotted, or would it?
Sorry for the length nit-picking mails but this gets all a bit complex...
Cheers
Simon