Re: [Bioc-sig-seq] seqselect on SimpleRleList and RangesList - bug? and request

Michael Lawrence Mon, 14 Jun 2010 15:12:18 -0700

Well, as I've said, I think this depends on whether the operation is a
"search" operation or not. For example, adding two RleLists would not be a
search operation, but findOverlaps() is. I think you make a good case for
treating seqselect() as a search operation. So +1 for Janet's specific
request.


Michael

On Mon, Jun 14, 2010 at 11:37 AM, Janet Young <[email protected]> wrote:

> Thank you both - sounds like this touches on a lot of different functions.
>  I've used findOverlaps quite a bit, so I had assumed other functions would
> work similarly, using the space names as indices. It would be good to make
> it really obvious to the user that isn't the case.
>
> From my naive user's point of view, it would be really useful to be able to
> use the chromosome names to select portions of a bunch of Rles (without
> requiring parallel space naming), in an analogous way to using findOverlaps
> on RangedData objects. Perhaps a switch on seqselect (usenames=TRUE)?  Or
> introduce another function?
>
> Or, if you go with Patrick's idea below about requiring space names either
> to be all NULL or all distinct and non-empty, then (by analogy with
> findOverlaps), you can use space names if they're present, and if they're
> not present go by list position. As for your question about what do when
> name sets are not identical, I haven't looked at how you solved that for
> findOverlaps but perhaps something analagous could work for seqselect.
>
> I guess I can do what I need to do by redefining my range information to
> use ordered factors as space names but it'd be really nice not to have to
> take those extra steps - it seems a lot more complicated than it needs to
> be. I think that'll work, because the way I was using seqselect on my real
> data was to define my ranges as RangedData objects, and then pass that to
> seqselect using ranges(my_rangeddata_object).
>
> I can imagine a lot of cases where the user might want to select scores
> from the whole genome on just a subset of ranges that don't include one on
> every chromosome (and where chromosomes might not be sorted in the same way
> as the genome).  Am I thinking about this the wrong way?  Maybe there are
> better ways to represent the scores than SimpleRleList that would allow this
> more easily.
>
> I don't know enough about the inner workings of IRanges to push strongly
> for any particular solution, but from the biologist's point of view here are
> a couple of questions to stimulate discussion.   Why allow names to be
> specified at all if they're not meaningful?  What kind of situation would a
> user be trying to represent with the "RangesList(a = IRanges(1,1), a =
> IRanges(1,2))" example? Could this situation simply be disallowed - would
> that mess up any real examples?
>
> thanks again,
>
> Janet
>
>
>
>
>
> On Jun 12, 2010, at 6:04 AM, Patrick Aboyoun wrote:
>
>  But what is the scope of space? For example, the reduce operation has no
>> concept of space (see below). In GenomicRanges, we introduced the concept of
>> seqlengths to a number of classes including GRanges and GRangesList. There
>> are certain restrictions of what can be held in a seqlengths slot, for
>> example you can't mix NAs with non-NAs. Perhaps we can formalize space for
>> all List objects so that you either have names of NULL or all the names must
>> be distinct, non-empty strings. We would also have to define what happens in
>> a binary operation involving two List objects when name sets are not
>> identical.
>>
>>
>> > RangesList(a = IRanges(1,1), a = IRanges(1,2))
>> SimpleRangesList of length 2
>> $a
>> IRanges of length 1
>>    start end width
>> [1]     1   1     1
>>
>> $a
>> IRanges of length 1
>>    start end width
>> [1]     1   2     2
>>
>> > validObject(RangesList(a = IRanges(1,1), a = IRanges(1,2)))
>> [1] TRUE
>> > reduce(RangesList(a = IRanges(1,1), a = IRanges(1,2)))
>> SimpleRangesList of length 2
>> $a
>> IRanges of length 1
>>    start end width
>> [1]     1   1     1
>>
>> $a
>> IRanges of length 1
>>    start end width
>> [1]     1   2     2
>>
>>
>>
>> Patrick
>>
>>
>>
>> On 6/12/10 5:47 AM, Michael Lawrence wrote:
>>
>>>
>>>
>>>
>>> On Sat, Jun 12, 2010 at 12:17 AM, Patrick Aboyoun <[email protected]>
>>> wrote:
>>> Janet,
>>> Most function in the IRanges package follows the R convention of
>>> considering the elements of names to be loosely linked attributes rather
>>> than rigid keys. For convenience, functions such as $, [, [[ treat a list as
>>> a hash if it has names, but in most circumstances the names are ignored or
>>> copied without use. Even when there are names on elements, there are some
>>> odd corner cases that can cause problems. For example, if I wanted to have
>>> multiple list elements with the same name, then some important operations
>>> give unexpected results:
>>>
>>> > list(a = 1, a = 2)["a"]
>>> $a
>>> [1] 1
>>>
>>> If the issue is limited to enhance the seqselect function to make it name
>>> aware, it probably makes sense to go ahead with the enhancement. But the
>>> scope of this issue can grow quite large. For example, should names be used
>>> when adding to RleList objects? What should the following produce
>>>
>>> RleList(a = Rle(1)) + RleList(a = Rle(2), a = Rle(3), b = Rle(4))
>>>
>>> Due to these types of ambiguities, I would rather focus on educating the
>>> user to be mindful that these are position-oriented rather than key-oriented
>>> objects and have them ensure that elements are in alignment.
>>>
>>> Thoughts?
>>>
>>>
>>>
>>> Sometimes in IRanges the names have a special semantic -- that of a
>>> "space". I guess this is limited to RangesList. Other data structures, like
>>> RleList, are often treated as being separated by space or chromosome, though
>>> their names have never explicitly been treated as the space. This
>>> inconsistency is probably OK, but it needs to be documented.
>>>
>>> Patrick
>>>
>>>
>>>
>>>
>>> On 6/11/10 4:06 PM, Janet Young wrote:
>>> Hi,
>>>
>>> I've been playing around with seqselect on scores stored in a
>>> SimpleRleList object to get subregions defined in a RangesList object.
>>>
>>> I found a couple of things:  first an enhancement request - would it be
>>> possible to allow seqselect to deal with cases where not every space (name)
>>> in the SimpleRleList has a corresponding space/name in the RangesList
>>> object?
>>>
>>> The second is either bug or else I've misunderstood the way seqselect is
>>> supposed to work, in a dangerous way - it looks like seqselect doesn't use
>>> the names of the list items to select scores, it just assumes that in the
>>> two lists the elements have the same names in the same order.
>>>
>>> The code below should explain both issues problem much better than those
>>> descriptions.
>>>
>>> thanks,
>>>
>>> Janet
>>>
>>>
>>>
>>> > library(IRanges)
>>>
>>> Attaching package: 'IRanges'
>>>
>>> The following object(s) are masked from 'package:base':
>>>
>>>   cbind, Map, mapply, order, paste, pmax, pmax.int, pmin, pmin.int,
>>> rbind, rep.int, table
>>>
>>> >
>>> > ### generate some arbitrary scores
>>> > track <- RangedData(RangesList(chrA = IRanges(start = c(1, 4, 6),
>>> width=c(3, 2, 4)),chrB = IRanges(start = c(1, 3, 6), width=c(3, 3, 4))) )
>>> > trackCoverage <- coverage(track,
>>> weight=list(chrA=c(2,7,3),chrB=c(1,1,1)) )
>>> >
>>> > ### define subregions
>>> > exons <- RangesList(chrA = IRanges(start = c(2, 4), width =
>>> c(2,2)),chrB = IRanges(start = 3, width = 5))
>>> >
>>> > ### seqselect works if all spaces in trackCoverage have an element in
>>> exons
>>> > seqselect(trackCoverage,exons )
>>> SimpleRleList of length 2
>>> $chrA
>>> 'integer' Rle of length 4 with 2 runs
>>>  Lengths: 2 2
>>>  Values : 2 7
>>>
>>> $chrB
>>> 'integer' Rle of length 5 with 2 runs
>>>  Lengths: 1 4
>>>  Values : 2 1
>>>
>>> >
>>> > ### define subregions only on one chr
>>> > exons_chrAonly <- RangesList(chrA = IRanges(start = c(2, 4), width =
>>> c(2, 2)))
>>> > ### now seqselect doesn't work if some spaces don't have any elements
>>> > seqselect(trackCoverage,exons_chrAonly )
>>> Error in seqselect(trackCoverage, exons_chrAonly) :
>>>  'length(start)' must equal 'length(x)' when 'end' and 'width' are NULL
>>> >
>>> >
>>> > ##### also, defining the regions with spaces in a different order seems
>>> to cause trouble as seqselect doesn't seem to be using the list's names -
>>> just going by order of elements
>>> > exons_reorderchrs <- RangesList(chrB = IRanges(start = 3, width =
>>> 5),chrA = IRanges(start = c(2, 4), width = c(2,2)))
>>> > seqselect(trackCoverage,exons_reorderchrs )
>>> SimpleRleList of length 2
>>> $chrA
>>> 'integer' Rle of length 5 with 3 runs
>>>  Lengths: 1 2 2
>>>  Values : 2 7 3
>>>
>>> $chrB
>>> 'integer' Rle of length 4 with 3 runs
>>>  Lengths: 1 1 2
>>>  Values : 1 2 1
>>>
>>> >
>>> > identical ( seqselect(trackCoverage,exons ) ,
>>> seqselect(trackCoverage,exons_reorderchrs )  )
>>> [1] FALSE
>>> >
>>> > sessionInfo()
>>> R version 2.11.1 (2010-05-31)
>>> i386-apple-darwin9.8.0
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] IRanges_1.6.6
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> [email protected]
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> [email protected]
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>>
>>>
>>
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] seqselect on SimpleRleList and RangesList - bug? and request

Reply via email to