As requested, I've put together an example of using the different intersection algorithms in bx, with some rough timings. The algorithms are:
-- Intersecter: this is the oldest and uses sorted endpoints. The search phase appears to be very slow, could be sped up a lot by improving that, but probably worth it (having run this, I'm probably going to just this and make it a wrapper to quicksect) -- quicksect: interval tree with random balance. This is what the join tool in Galaxy uses -- quicksect.pyx: a 'port' of quicksect to Cython from Brent Pederson. This is really nice, but it doesn't have exactly the same interface as the other quicksect so it hadn't been integrated yet. One important thing about this is that it can also do "neighborhood" queries -- find the 10 closest features upstream of ... -- interval_index_file.Index: tree of bins + sorted lists in bins. The advantage of this is that it has a nice on-disk representation. One can build an index of a huge number of intervals, save it, and then queries require only loading a subset of the bins. Probably the closest thing to NCList. Pure Python, so room for optimization. It can be used just in memory or on disk, I've timed both. Note: normally this should be used through the wrapper "Indexes" which writes nice headers with version, value size, et cetera. Code: http://bx.mathcs.emory.edu/hg/overlap-query-testing/ The output: Big to small python overlap_examples.py data/ hg18_chr1_phastConsElements44wayPlacental.bed.bz2 data/ hg18_chr1_knownGene.bed.bz2 Query intervals: data/hg18_chr1_phastConsElements44wayPlacental.bed.bz2 Target intervals: data/hg18_chr1_knownGene.bed.bz2 ---> Using bx.intervals.intersection.Intersecter Building: 0.026 seconds Using: 14.943 seconds ---> Using bx.intervals.operations.quicksect.IntervalNode Building: 0.415 seconds Using: 9.561 seconds ---> Using quicksect.IntervalNode (Cython version from Brent Pederson) Building: 0.019 seconds Using: 0.268 seconds ---> Using bx.interval_index_file.Index (in memory) Building: 0.026 seconds Using: 5.947 seconds ---> Using bx.interval_index_file.Index (on disk) Building: 0.026 seconds Writing: 0.069 seconds Using: 3.290 seconds Small to big python overlap_examples.py data/hg18_chr1_knownGene.bed.bz2 data/ hg18_chr1_phastConsElements44wayPlacental.bed.bz2 Query intervals: data/hg18_chr1_knownGene.bed.bz2 Target intervals: data/hg18_chr1_phastConsElements44wayPlacental.bed.bz2 ---> Using bx.intervals.intersection.Intersecter Building: 2.108 seconds Using: 14.349 seconds ---> Using bx.intervals.operations.quicksect.IntervalNode Building: 31.120 seconds Using: 1.331 seconds ---> Using quicksect.IntervalNode (Cython version from Brent Pederson) Building: 2.208 seconds Using: 0.049 seconds ---> Using bx.interval_index_file.Index (in memory) Building: 2.371 seconds Using: 0.987 seconds ---> Using bx.interval_index_file.Index (on disk) Building: 2.694 seconds Writing: 2.487 seconds Using: 0.111 seconds On Feb 11, 2009, at 3:15 PM, Istvan Albert wrote: > On Feb 11, 2:44 pm, James Taylor <[email protected]> wrote: > >> It'd be really cool to compare some of the other features like >> IntervalIndexFile vs. NLMSA. I've not had time to do it, but it might >> be interesting (and encourage one or all of us to do some >> optimization ;). > > lately I've become interested in exploring interval overlap query > algorithms. It looks like just about everything in the genome is > transcribed at some point, thus we will end up with orders of > magnitude more interval data than what we were used to have. Having a > seamless way to query these is just as important as being able to > fetch a sequence and operate on it. > > If you guys write some docs on how this IntervalIndexFile is used in > practice, with some some examples etc. I'll write benchmarks to > compare it to other type of queries. > > Istvan > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "pygr-dev" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/pygr-dev?hl=en -~----------~----~----~----~------~----~------~--~---
