Well, as I've said, I think this depends on whether the operation is a "search" operation or not. For example, adding two RleLists would not be a search operation, but findOverlaps() is. I think you make a good case for treating seqselect() as a search operation. So +1 for Janet's specific request.
Michael On Mon, Jun 14, 2010 at 11:37 AM, Janet Young <[email protected]> wrote: > Thank you both - sounds like this touches on a lot of different functions. > I've used findOverlaps quite a bit, so I had assumed other functions would > work similarly, using the space names as indices. It would be good to make > it really obvious to the user that isn't the case. > > From my naive user's point of view, it would be really useful to be able to > use the chromosome names to select portions of a bunch of Rles (without > requiring parallel space naming), in an analogous way to using findOverlaps > on RangedData objects. Perhaps a switch on seqselect (usenames=TRUE)? Or > introduce another function? > > Or, if you go with Patrick's idea below about requiring space names either > to be all NULL or all distinct and non-empty, then (by analogy with > findOverlaps), you can use space names if they're present, and if they're > not present go by list position. As for your question about what do when > name sets are not identical, I haven't looked at how you solved that for > findOverlaps but perhaps something analagous could work for seqselect. > > I guess I can do what I need to do by redefining my range information to > use ordered factors as space names but it'd be really nice not to have to > take those extra steps - it seems a lot more complicated than it needs to > be. I think that'll work, because the way I was using seqselect on my real > data was to define my ranges as RangedData objects, and then pass that to > seqselect using ranges(my_rangeddata_object). > > I can imagine a lot of cases where the user might want to select scores > from the whole genome on just a subset of ranges that don't include one on > every chromosome (and where chromosomes might not be sorted in the same way > as the genome). Am I thinking about this the wrong way? Maybe there are > better ways to represent the scores than SimpleRleList that would allow this > more easily. > > I don't know enough about the inner workings of IRanges to push strongly > for any particular solution, but from the biologist's point of view here are > a couple of questions to stimulate discussion. Why allow names to be > specified at all if they're not meaningful? What kind of situation would a > user be trying to represent with the "RangesList(a = IRanges(1,1), a = > IRanges(1,2))" example? Could this situation simply be disallowed - would > that mess up any real examples? > > thanks again, > > Janet > > > > > > On Jun 12, 2010, at 6:04 AM, Patrick Aboyoun wrote: > > But what is the scope of space? For example, the reduce operation has no >> concept of space (see below). In GenomicRanges, we introduced the concept of >> seqlengths to a number of classes including GRanges and GRangesList. There >> are certain restrictions of what can be held in a seqlengths slot, for >> example you can't mix NAs with non-NAs. Perhaps we can formalize space for >> all List objects so that you either have names of NULL or all the names must >> be distinct, non-empty strings. We would also have to define what happens in >> a binary operation involving two List objects when name sets are not >> identical. >> >> >> > RangesList(a = IRanges(1,1), a = IRanges(1,2)) >> SimpleRangesList of length 2 >> $a >> IRanges of length 1 >> start end width >> [1] 1 1 1 >> >> $a >> IRanges of length 1 >> start end width >> [1] 1 2 2 >> >> > validObject(RangesList(a = IRanges(1,1), a = IRanges(1,2))) >> [1] TRUE >> > reduce(RangesList(a = IRanges(1,1), a = IRanges(1,2))) >> SimpleRangesList of length 2 >> $a >> IRanges of length 1 >> start end width >> [1] 1 1 1 >> >> $a >> IRanges of length 1 >> start end width >> [1] 1 2 2 >> >> >> >> Patrick >> >> >> >> On 6/12/10 5:47 AM, Michael Lawrence wrote: >> >>> >>> >>> >>> On Sat, Jun 12, 2010 at 12:17 AM, Patrick Aboyoun <[email protected]> >>> wrote: >>> Janet, >>> Most function in the IRanges package follows the R convention of >>> considering the elements of names to be loosely linked attributes rather >>> than rigid keys. For convenience, functions such as $, [, [[ treat a list as >>> a hash if it has names, but in most circumstances the names are ignored or >>> copied without use. Even when there are names on elements, there are some >>> odd corner cases that can cause problems. For example, if I wanted to have >>> multiple list elements with the same name, then some important operations >>> give unexpected results: >>> >>> > list(a = 1, a = 2)["a"] >>> $a >>> [1] 1 >>> >>> If the issue is limited to enhance the seqselect function to make it name >>> aware, it probably makes sense to go ahead with the enhancement. But the >>> scope of this issue can grow quite large. For example, should names be used >>> when adding to RleList objects? What should the following produce >>> >>> RleList(a = Rle(1)) + RleList(a = Rle(2), a = Rle(3), b = Rle(4)) >>> >>> Due to these types of ambiguities, I would rather focus on educating the >>> user to be mindful that these are position-oriented rather than key-oriented >>> objects and have them ensure that elements are in alignment. >>> >>> Thoughts? >>> >>> >>> >>> Sometimes in IRanges the names have a special semantic -- that of a >>> "space". I guess this is limited to RangesList. Other data structures, like >>> RleList, are often treated as being separated by space or chromosome, though >>> their names have never explicitly been treated as the space. This >>> inconsistency is probably OK, but it needs to be documented. >>> >>> Patrick >>> >>> >>> >>> >>> On 6/11/10 4:06 PM, Janet Young wrote: >>> Hi, >>> >>> I've been playing around with seqselect on scores stored in a >>> SimpleRleList object to get subregions defined in a RangesList object. >>> >>> I found a couple of things: first an enhancement request - would it be >>> possible to allow seqselect to deal with cases where not every space (name) >>> in the SimpleRleList has a corresponding space/name in the RangesList >>> object? >>> >>> The second is either bug or else I've misunderstood the way seqselect is >>> supposed to work, in a dangerous way - it looks like seqselect doesn't use >>> the names of the list items to select scores, it just assumes that in the >>> two lists the elements have the same names in the same order. >>> >>> The code below should explain both issues problem much better than those >>> descriptions. >>> >>> thanks, >>> >>> Janet >>> >>> >>> >>> > library(IRanges) >>> >>> Attaching package: 'IRanges' >>> >>> The following object(s) are masked from 'package:base': >>> >>> cbind, Map, mapply, order, paste, pmax, pmax.int, pmin, pmin.int, >>> rbind, rep.int, table >>> >>> > >>> > ### generate some arbitrary scores >>> > track <- RangedData(RangesList(chrA = IRanges(start = c(1, 4, 6), >>> width=c(3, 2, 4)),chrB = IRanges(start = c(1, 3, 6), width=c(3, 3, 4))) ) >>> > trackCoverage <- coverage(track, >>> weight=list(chrA=c(2,7,3),chrB=c(1,1,1)) ) >>> > >>> > ### define subregions >>> > exons <- RangesList(chrA = IRanges(start = c(2, 4), width = >>> c(2,2)),chrB = IRanges(start = 3, width = 5)) >>> > >>> > ### seqselect works if all spaces in trackCoverage have an element in >>> exons >>> > seqselect(trackCoverage,exons ) >>> SimpleRleList of length 2 >>> $chrA >>> 'integer' Rle of length 4 with 2 runs >>> Lengths: 2 2 >>> Values : 2 7 >>> >>> $chrB >>> 'integer' Rle of length 5 with 2 runs >>> Lengths: 1 4 >>> Values : 2 1 >>> >>> > >>> > ### define subregions only on one chr >>> > exons_chrAonly <- RangesList(chrA = IRanges(start = c(2, 4), width = >>> c(2, 2))) >>> > ### now seqselect doesn't work if some spaces don't have any elements >>> > seqselect(trackCoverage,exons_chrAonly ) >>> Error in seqselect(trackCoverage, exons_chrAonly) : >>> 'length(start)' must equal 'length(x)' when 'end' and 'width' are NULL >>> > >>> > >>> > ##### also, defining the regions with spaces in a different order seems >>> to cause trouble as seqselect doesn't seem to be using the list's names - >>> just going by order of elements >>> > exons_reorderchrs <- RangesList(chrB = IRanges(start = 3, width = >>> 5),chrA = IRanges(start = c(2, 4), width = c(2,2))) >>> > seqselect(trackCoverage,exons_reorderchrs ) >>> SimpleRleList of length 2 >>> $chrA >>> 'integer' Rle of length 5 with 3 runs >>> Lengths: 1 2 2 >>> Values : 2 7 3 >>> >>> $chrB >>> 'integer' Rle of length 4 with 3 runs >>> Lengths: 1 1 2 >>> Values : 1 2 1 >>> >>> > >>> > identical ( seqselect(trackCoverage,exons ) , >>> seqselect(trackCoverage,exons_reorderchrs ) ) >>> [1] FALSE >>> > >>> > sessionInfo() >>> R version 2.11.1 (2010-05-31) >>> i386-apple-darwin9.8.0 >>> >>> locale: >>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] IRanges_1.6.6 >>> >>> _______________________________________________ >>> Bioc-sig-sequencing mailing list >>> [email protected] >>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>> >>> _______________________________________________ >>> Bioc-sig-sequencing mailing list >>> [email protected] >>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing >>> >>> >> > [[alternative HTML version deleted]] _______________________________________________ Bioc-sig-sequencing mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
