Thank you both - sounds like this touches on a lot of different
functions. I've used findOverlaps quite a bit, so I had assumed other
functions would work similarly, using the space names as indices. It
would be good to make it really obvious to the user that isn't the case.
From my naive user's point of view, it would be really useful to be
able to use the chromosome names to select portions of a bunch of Rles
(without requiring parallel space naming), in an analogous way to
using findOverlaps on RangedData objects. Perhaps a switch on
seqselect (usenames=TRUE)? Or introduce another function?
Or, if you go with Patrick's idea below about requiring space names
either to be all NULL or all distinct and non-empty, then (by analogy
with findOverlaps), you can use space names if they're present, and if
they're not present go by list position. As for your question about
what do when name sets are not identical, I haven't looked at how you
solved that for findOverlaps but perhaps something analagous could
work for seqselect.
I guess I can do what I need to do by redefining my range information
to use ordered factors as space names but it'd be really nice not to
have to take those extra steps - it seems a lot more complicated than
it needs to be. I think that'll work, because the way I was using
seqselect on my real data was to define my ranges as RangedData
objects, and then pass that to seqselect using
ranges(my_rangeddata_object).
I can imagine a lot of cases where the user might want to select
scores from the whole genome on just a subset of ranges that don't
include one on every chromosome (and where chromosomes might not be
sorted in the same way as the genome). Am I thinking about this the
wrong way? Maybe there are better ways to represent the scores than
SimpleRleList that would allow this more easily.
I don't know enough about the inner workings of IRanges to push
strongly for any particular solution, but from the biologist's point
of view here are a couple of questions to stimulate discussion. Why
allow names to be specified at all if they're not meaningful? What
kind of situation would a user be trying to represent with the
"RangesList(a = IRanges(1,1), a = IRanges(1,2))" example? Could this
situation simply be disallowed - would that mess up any real examples?
thanks again,
Janet
On Jun 12, 2010, at 6:04 AM, Patrick Aboyoun wrote:
But what is the scope of space? For example, the reduce operation
has no concept of space (see below). In GenomicRanges, we introduced
the concept of seqlengths to a number of classes including GRanges
and GRangesList. There are certain restrictions of what can be held
in a seqlengths slot, for example you can't mix NAs with non-NAs.
Perhaps we can formalize space for all List objects so that you
either have names of NULL or all the names must be distinct, non-
empty strings. We would also have to define what happens in a binary
operation involving two List objects when name sets are not identical.
> RangesList(a = IRanges(1,1), a = IRanges(1,2))
SimpleRangesList of length 2
$a
IRanges of length 1
start end width
[1] 1 1 1
$a
IRanges of length 1
start end width
[1] 1 2 2
> validObject(RangesList(a = IRanges(1,1), a = IRanges(1,2)))
[1] TRUE
> reduce(RangesList(a = IRanges(1,1), a = IRanges(1,2)))
SimpleRangesList of length 2
$a
IRanges of length 1
start end width
[1] 1 1 1
$a
IRanges of length 1
start end width
[1] 1 2 2
Patrick
On 6/12/10 5:47 AM, Michael Lawrence wrote:
On Sat, Jun 12, 2010 at 12:17 AM, Patrick Aboyoun
<[email protected]> wrote:
Janet,
Most function in the IRanges package follows the R convention of
considering the elements of names to be loosely linked attributes
rather than rigid keys. For convenience, functions such as $, [,
[[ treat a list as a hash if it has names, but in most
circumstances the names are ignored or copied without use. Even
when there are names on elements, there are some odd corner cases
that can cause problems. For example, if I wanted to have multiple
list elements with the same name, then some important operations
give unexpected results:
> list(a = 1, a = 2)["a"]
$a
[1] 1
If the issue is limited to enhance the seqselect function to make
it name aware, it probably makes sense to go ahead with the
enhancement. But the scope of this issue can grow quite large. For
example, should names be used when adding to RleList objects? What
should the following produce
RleList(a = Rle(1)) + RleList(a = Rle(2), a = Rle(3), b = Rle(4))
Due to these types of ambiguities, I would rather focus on
educating the user to be mindful that these are position-oriented
rather than key-oriented objects and have them ensure that elements
are in alignment.
Thoughts?
Sometimes in IRanges the names have a special semantic -- that of a
"space". I guess this is limited to RangesList. Other data
structures, like RleList, are often treated as being separated by
space or chromosome, though their names have never explicitly been
treated as the space. This inconsistency is probably OK, but it
needs to be documented.
Patrick
On 6/11/10 4:06 PM, Janet Young wrote:
Hi,
I've been playing around with seqselect on scores stored in a
SimpleRleList object to get subregions defined in a RangesList
object.
I found a couple of things: first an enhancement request - would
it be possible to allow seqselect to deal with cases where not
every space (name) in the SimpleRleList has a corresponding space/
name in the RangesList object?
The second is either bug or else I've misunderstood the way
seqselect is supposed to work, in a dangerous way - it looks like
seqselect doesn't use the names of the list items to select scores,
it just assumes that in the two lists the elements have the same
names in the same order.
The code below should explain both issues problem much better than
those descriptions.
thanks,
Janet
> library(IRanges)
Attaching package: 'IRanges'
The following object(s) are masked from 'package:base':
cbind, Map, mapply, order, paste, pmax, pmax.int, pmin,
pmin.int, rbind, rep.int, table
>
> ### generate some arbitrary scores
> track <- RangedData(RangesList(chrA = IRanges(start = c(1, 4, 6),
width=c(3, 2, 4)),chrB = IRanges(start = c(1, 3, 6), width=c(3, 3,
4))) )
> trackCoverage <- coverage(track,
weight=list(chrA=c(2,7,3),chrB=c(1,1,1)) )
>
> ### define subregions
> exons <- RangesList(chrA = IRanges(start = c(2, 4), width =
c(2,2)),chrB = IRanges(start = 3, width = 5))
>
> ### seqselect works if all spaces in trackCoverage have an
element in exons
> seqselect(trackCoverage,exons )
SimpleRleList of length 2
$chrA
'integer' Rle of length 4 with 2 runs
Lengths: 2 2
Values : 2 7
$chrB
'integer' Rle of length 5 with 2 runs
Lengths: 1 4
Values : 2 1
>
> ### define subregions only on one chr
> exons_chrAonly <- RangesList(chrA = IRanges(start = c(2, 4),
width = c(2, 2)))
> ### now seqselect doesn't work if some spaces don't have any
elements
> seqselect(trackCoverage,exons_chrAonly )
Error in seqselect(trackCoverage, exons_chrAonly) :
'length(start)' must equal 'length(x)' when 'end' and 'width' are
NULL
>
>
> ##### also, defining the regions with spaces in a different order
seems to cause trouble as seqselect doesn't seem to be using the
list's names - just going by order of elements
> exons_reorderchrs <- RangesList(chrB = IRanges(start = 3, width =
5),chrA = IRanges(start = c(2, 4), width = c(2,2)))
> seqselect(trackCoverage,exons_reorderchrs )
SimpleRleList of length 2
$chrA
'integer' Rle of length 5 with 3 runs
Lengths: 1 2 2
Values : 2 7 3
$chrB
'integer' Rle of length 4 with 3 runs
Lengths: 1 1 2
Values : 1 2 1
>
> identical ( seqselect(trackCoverage,exons ) ,
seqselect(trackCoverage,exons_reorderchrs ) )
[1] FALSE
>
> sessionInfo()
R version 2.11.1 (2010-05-31)
i386-apple-darwin9.8.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] IRanges_1.6.6
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing