Simon,I have enhanced the aggregate and seqextract functions to accept RangesList object for subscript selection within AtomicList objects, which include RleList, IntegerList, NumericList, etc. On performance grounds, I still recommend using the viewSums approach for calculating AUC because it performs its looping at the C level, rather than the R level as aggregate does.
Keep the requests coming. Much of the IRanges package has been developed for general usage (I believe it could be a major player in time series analysis as well if developers that that community were interested in leveraging its infrastructure) and, as this week has shown, there is still work to do to support biological sequence analysis. Right now we are working on the concepts of Views and RangedData. In the future, I have a feeling we will need to beef up the representation of alignments, multiple alignments, motifs (in an abstract model sense), and other areas so we can have interchangeable special-purpose Bioconductor packages that work coherently to form a whole.
> suppressMessages(library(IRanges)) > track <- RangedData( + RangesList( + chrA = IRanges(start = c(1, 4, 6), width=c(3, 2, 4)), + chrB = IRanges(start = c(1, 3, 6), width=c(3, 3, 4))), + score = c( 2, 7, 3, 1, 1, 1 ) ) > trackCoverage <- coverage(track, weight="score") > exons <- RangesList( + chrA = IRanges(start = c(2, 4), width = c(2, 2)), + chrB = IRanges(start = 3, width = 5)) > aggregate(trackCoverage, exons, sum) SimpleList: 2 elements names(2): chrA chrB > aggregate(trackCoverage, exons, sum)[["chrA"]] [1] 4 14 > exonView <- RleViewsList(rleList=trackCoverage, rangesList=exons) > viewSums(exonView) SimpleIntegerList: 2 elements names(2): chrA chrB > identical(aggregate(trackCoverage, exons, sum)[["chrA"]], + viewSums(exonView)[["chrA"]]) [1] TRUE > sessionInfo() R version 2.10.0 Under development (unstable) (2009-06-28 r48863) i386-apple-darwin9.7.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages:[1] stats graphics grDevices utils datasets methods base
other attached packages: [1] IRanges_1.3.37 Patrick Aboyoun wrote:
Simon,I have checked in changes to IRanges (now version 1.3.36), BioC 2.5 (devel), that improve the RangedData constructor to accept a RangesList object for its ranges object as well as added passed the element names when performing view* operations. I'm going to consider the aggregate use case some more and get back to you shortly on that one. Here is a revised example:> suppressMessages(library(IRanges)) > track <- RangedData( + RangesList( + chrA = IRanges(start = c(1, 4, 6), width=c(3, 2, 4)), + chrB = IRanges(start = c(1, 3, 6), width=c(3, 3, 4)) ), + score = c( 2, 7, 3, 1, 1, 1 ) ) > track RangedData: 6 ranges by 1 column on 2 sequences colnames(1): score names(2): chrA chrB > trackCoverage <- coverage(track, weight="score" ) > trackCoverage SimpleRleList: 2 elements names(2): chrA chrB > exons <- RangesList( + chrA = IRanges( start = c(2, 4), width = c(2, 2) ), + chrB = IRanges( start = 3, width = 5 ) ) > exons SimpleRangesList: 2 elements names(2): chrA chrB > exonView <- RleViewsList( rleList=trackCoverage, rangesList=exons ) > exonView SimpleRleViewsList: 2 elements names(2): chrA chrB > viewSums(exonView) SimpleIntegerList: 2 elements names(2): chrA chrB > viewSums(exonView)[["chrA"]] [1] 4 14 > sessionInfo() R version 2.10.0 Under development (unstable) (2009-06-28 r48863) i386-apple-darwin9.7.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages:[1] stats graphics grDevices utils datasets methods base other attached packages:[1] IRanges_1.3.36 Patrick Aboyoun wrote:Simon, Answering your issues in turn:Issue 1) Given this, it seems to me now that viewApply and coverage have very similar use cases. Do we need both?Reponse 1) Yes. The viewApply function is a general purpose tool for implementing a function over a set of Views. The coverage function performs a (weighted) depth count of interval/ranges over a chosen domain. The viewSums, viewMaxs, etc. are specializations of the viewApply function that are built for performance reasons, much like the colSums, etc. functions are for matrices.Issue 2) I am still quite worried that nobody outside the Hutch will be able to figure out which class to use when. There are simply so many classes in IRanges. Do you have a class hierarchy diagram?Response 2) As a first step Michael and I have fleshed out the IRanges vignettehttp://bioconductor.org/packages/2.5/bioc/vignettes/IRanges/inst/doc/IRangesOverview.pdfto explain more about the IRanges make up. The next step in the documentation is to map the entities in biological sequence analysis to IRanges classes. The chipseq package used an amorphous GenomeData class for this purpose, but that is just a glorified list that in one context can store coverage information and in another context can store exon boundaries. I have been trying to revise the chipseq package to be more specific on what data type should represent what sequence analysis entity. So for example, an RleList object is a natural representation of mapped read coverage information across chromosomes/contigs. If you want to associate exon boundaries to coverage data, then RleViewsList objects serve as a natural representation. Once we list out all of the sequence analysis entities, we can create a two column table mapping the entity to the Bioconductor class. We can then create something akin to a UML sequence diagram to show how you get from one object type to another in a typical sequence analysis workflow.Issue 3) The RangedData constructor isn't intuitive. The following commandv3b <- RangedData( RangesList( chrA = IRanges(start = c(1, 4, 6), width=c(3, 2, 4)), chrB = IRanges(start = c(1, 3, 6), width=c(3, 3, 4)) ), score = c( 2, 7, 3, 1, 1, 1 ) ) doesn't work as expected.Response 3) The constructor wasn't general enough. I will fix this shortly.Issue 4) The aggregate method used for *List objects aren't vectorized over list elements and the error message isn't useful.Reponse 4) I need to think about this one for a bit because one could feasibly wish to aggregate across list elements and not within them. Perhaps I could check the nature of the by argument and if it is a "list", then aggregate within elements. Otherwise, aggregate between elements as it currently does.Issue 5) The view* functions don't keep element names from original object.Reponse 5) This was an oversight on my part and I will correct this shortly.I'll send out an e-mail to the group once I check in these changes. Patrick Simon Anders wrote:Hi Patrick Patrick Aboyoun wrote:I've made further changes to the IRanges package to make it easier to perform the coverage operations across chromosomes/contigs. I've added RleViewsList (virtual) and SimpleRleViewsList (non-virtual) classes to IRanges as well as a RleViewsList constructor and viewSums and otherview* summary methods for RleViewList objects. The code below looks much cleaner now due to the RleViewsList constructor and the viewSums method.Also, I point out the viewApply function, which serves in the capacity of the aggregate function that you were experimenting with in your e-mails. Do these steps fit into your workflow?Given this, it seems to me now that viewApply and coverage have very similar use cases. Do we need both? I am still quite worried that nobody outside the Hutch will be able to figure out which class to use when. There are simply so many classes in IRanges. Do you have a class hierarchy diagram?A justification for all the RangedData-style classes is, of course, thatone usually deals with multiple chromosomes. So, I went through the example of my last mail again, now with two chromosomes. Last time, we had this here, a RangedData with a single, unnamed sequence, and one column, 'score':v3 <- RangedData(+ IRanges( start = c( 1, 4, 6 ), width = c( 3, 2, 4 ) ), + score = c( 2, 7, 3 ) )v3RangedData: 3 ranges by 1 column on 1 sequence colnames(1): scoreNow, I attempt to make a RangedData with two sequences, 'chrA' and 'chrB':v3b <- RangedData(+ RangesList( + chrA = IRanges(start = c(1, 4, 6), width=c(3, 2, 4)), + chrB = IRanges(start = c(1, 3, 6), width=c(3, 3, 4)) ), + score = c( 2, 7, 3, 1, 1, 1 ) )v3bRangedData: 6 ranges by 0 columns on 2 sequences colnames(0): names(2): chrA chrBAs you can see, the 'score' argument was ignored. I've tried a couple ofother syntaxes, but could not get it in. What's the trick? To proceed, I add my scores manually:v3b$score <- c( 2, 7, 3, 1, 1, 1 )I still feel a bit uneasy about the fact that the split between the sequences is not apparent here, but maybe it is simplest like this. I now switch over to coverage vectors:v2b <- coverage( v3b, weight="score" )This has now nicely assembled everything over both chromosomes:as.vector( v2b$chrA )[1] 2 2 2 7 7 3 3 3 3as.vector( v2b$chrB )[1] 1 1 2 1 1 1 1 1 1 Let's get some exons, two on chrA and one on chrB:exons <- RangesList(+ chrA = IRanges( start = c(2, 4), width = c(2, 2) ), + chrB = IRanges( start = 3, width = 5 ) ) Now, 'aggregate' fails, with a rather misleading error message:Error in solveWindowSEW(length(x), start = ifelse(is.null(start), NA, :aggregate( v2b, exons, sum )Invalid sequence coordinates. Please make sure the supplied 'start', 'end' and 'width' arguments are defining a region that is within the limits of the sequence. So, I'll try with views:v2exons <- RleViewsList( rleList=v2b, rangesList=exons )viewSums( v2exons )SimpleIntegerList: 2 elementsviewSums( v2exons )[[1]][1] 4 14viewSums( v2exons )[[2]][1] 6 This worked, but we lost the sequence names:viewSums( v2exons )[["chrA"]]NULLviewSums( v2exons )$chrANULL I suppose that we again rely here on the sequences being in the same order in v2b and exons, i.e., if one has 'chrA' and 'chrB', and the other, say, 'chrB' and 'chrC', this would not get spotted, or would it?Sorry for the length nit-picking mails but this gets all a bit complex...Cheers Simon_______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing_______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
_______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
