On 04/03/2013 10:28 AM, Kasper Daniel Hansen wrote:
Making IntervalTree chromosome would also be a great addition for organisms
with many sequences, like bee (due to an incomplete genome; 10,000s of
sequences).  It does not matter for humans, but findOverlaps is
excruciatingly slow for bee's.  I have a couple of posts on this in the
archive.

I am predicting this will be the case in the future for most non-model
organisms; finishing a genome is expensive and time consuming.

A cheap (both computationally and philosophically) hack is along the lines of

gr <- GRanges(c("A", "B", "A"), IRanges(c(2000, 1000, 3000), width=100))

rng <- range(gr)
off0 <- (width(rng)[-length(rng)] + 1L)
offset <- setNames(c(1L, cumsum(off0)) - start(rng), seqnames(rng))
shift(ranges(gr), offset[as.character(seqnames(gr))])

which shifts the ranges to be non-overlapping by seqname, so findOverlaps can operate on the IRanges directly. This is a trick I used in selectMethod(disjoin, "CompressedIRangesList"). There's the risk of integer overflow (when sum(width(rng) + 1L) > .Machine$integer.max) but this could be handled (if it occurs!) by partitioning into .Machine$integer.max-sized ranges. Unless IRanges were updated to support longer integers...

Martin


Kasper


On Wed, Apr 3, 2013 at 12:45 PM, Michael Lawrence <lawrence.mich...@gene.com
wrote:

Some ideas:

- Turn the IntervalTree into a list/array of nodes that can be
subset/reordered with shallow copying (just copy the pointers to the
nodes), and the index would be secondary. The index in the array could be
stored in each node, for lookup during overlap queries. Right now, as far
as I can tell, GIntervalTree will get confused if the user reorders e.g.
via [.

- Make IntervalTree aware of the sequence/chromosome, e.g., have a hash of
trees, which is trivial since seqnames is already a factor.

Michael



On Wed, Apr 3, 2013 at 9:29 AM, Hector Corrada Bravo <hcorr...@gmail.com
wrote:

Yep, I didn't comment on that, but I agree that abstracting how
GRanges stores ranges would make this more elegant. Right now
ranges(GRanges) is specified to be of IRanges class instead of the
abstract Ranges class.

If it were the latter then GIntervalTree can be a subclass of
GenomicRanges, in a similar way that IntervalTree is a subclass of
Ranges.

On Wed, Apr 3, 2013 at 12:23 PM, Michael Lawrence
<lawrence.mich...@gene.com> wrote:
Hi Hector,

That's interesting, thanks for passing this along. I'm still wishing
that
somehow GRanges itself could abstract the way it stores ranges. I know
that
Herve/Patrick had some reasons for depending specifically on GRanges.
One
reason was probably convenience at the C level, but it wouldn't be hard
to
create a Ranges abstraction at the C level, as well.

Michael



On Tue, Apr 2, 2013 at 5:40 PM, Hector Corrada Bravo <
hcorr...@gmail.com

wrote:

Hello bioc-develers,

I'm writing an application where lots findOverlap calls are made on
static GRanges objects. For IRanges we can create persistent
IntervalTree objects that would serve the multiple overlap query
use-case. There is no equivalent for GenomicRanges objects, so I'm
proposing an implementation for this.

Please check
http://github.com/hcorrada/GenomicIntervalTree

There's a first cut implementation there you can test by installing
this skeleton package. E.g,

library(devtools)
install_github("GenomicIntervalTree", username="hcorrada",
subdir="pkg")
library(GenomicIntervalTree)

Let me know what you think.

Cheers,
Hector

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to