As requested, I've put together an example of using the different  
intersection algorithms in bx, with some rough timings. The algorithms  
are:

-- Intersecter: this is the oldest and uses sorted endpoints. The  
search phase appears to be very slow, could be sped up a lot by  
improving that, but probably worth it (having run this, I'm probably  
going to just this and make it a wrapper to quicksect)

-- quicksect: interval tree with random balance. This is what the join  
tool in Galaxy uses

-- quicksect.pyx: a 'port' of quicksect to Cython from Brent Pederson.  
This is really nice, but it doesn't have exactly the same interface as  
the other quicksect so it hadn't been integrated yet. One important  
thing about this is that it can also do "neighborhood" queries -- find  
the 10 closest features upstream of ...

-- interval_index_file.Index: tree of bins + sorted lists in bins. The  
advantage of this is that it has a nice on-disk representation. One  
can build an index of a huge number of intervals, save it, and then  
queries require only loading a subset of the bins. Probably the  
closest thing to NCList. Pure Python, so room for optimization. It can  
be used just in memory or on disk, I've timed both. Note: normally  
this should be used through the wrapper "Indexes" which writes nice  
headers with version, value size, et cetera. 

Code: http://bx.mathcs.emory.edu/hg/overlap-query-testing/

The output:

Big to small

python overlap_examples.py data/ 
hg18_chr1_phastConsElements44wayPlacental.bed.bz2 data/ 
hg18_chr1_knownGene.bed.bz2
Query intervals:  data/hg18_chr1_phastConsElements44wayPlacental.bed.bz2
Target intervals: data/hg18_chr1_knownGene.bed.bz2
---> Using bx.intervals.intersection.Intersecter
Building: 0.026 seconds
Using: 14.943 seconds
---> Using bx.intervals.operations.quicksect.IntervalNode
Building: 0.415 seconds
Using: 9.561 seconds
---> Using quicksect.IntervalNode (Cython version from Brent Pederson)
Building: 0.019 seconds
Using: 0.268 seconds
---> Using bx.interval_index_file.Index (in memory)
Building: 0.026 seconds
Using: 5.947 seconds
---> Using bx.interval_index_file.Index (on disk)
Building: 0.026 seconds
Writing: 0.069 seconds
Using: 3.290 seconds

Small to big

python overlap_examples.py data/hg18_chr1_knownGene.bed.bz2 data/ 
hg18_chr1_phastConsElements44wayPlacental.bed.bz2
Query intervals:  data/hg18_chr1_knownGene.bed.bz2
Target intervals: data/hg18_chr1_phastConsElements44wayPlacental.bed.bz2
---> Using bx.intervals.intersection.Intersecter
Building: 2.108 seconds
Using: 14.349 seconds
---> Using bx.intervals.operations.quicksect.IntervalNode
Building: 31.120 seconds
Using: 1.331 seconds
---> Using quicksect.IntervalNode (Cython version from Brent Pederson)
Building: 2.208 seconds
Using: 0.049 seconds
---> Using bx.interval_index_file.Index (in memory)
Building: 2.371 seconds
Using: 0.987 seconds
---> Using bx.interval_index_file.Index (on disk)
Building: 2.694 seconds
Writing: 2.487 seconds
Using: 0.111 seconds


On Feb 11, 2009, at 3:15 PM, Istvan Albert wrote:

> On Feb 11, 2:44 pm, James Taylor <[email protected]> wrote:
>
>> It'd be really cool to compare some of the other features like
>> IntervalIndexFile vs. NLMSA. I've not had time to do it, but it might
>> be interesting (and encourage one or all of us to do some
>> optimization ;).
>
> lately I've become interested in exploring interval overlap query
> algorithms. It looks like just about everything in the genome is
> transcribed at some point, thus we will end up with orders of
> magnitude more interval data than what we were used to have. Having a
> seamless way to query these is just as important as being able to
> fetch a sequence and operate on it.
>
> If you guys write some docs on how this IntervalIndexFile is used in
> practice, with some some examples etc. I'll write benchmarks to
> compare it to other type of queries.
>
> Istvan
> 

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pygr-dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to