John, Thanks for the tips. Can this be specified in the -part.txt? I have one follow up question. Suppose that the features mapped to the dna are short enough that you don't care about features that start before the query window or end after the query window. The query would now be: SELECT col1,col2,col3 FROM table WHERE dna=dna_id AND i_start < start < i_end
In this case, would it make sense for the start column to have an interval encoded index? Also, I read Rishi Sinha's paper on multi- level indexes and it seems like a 2 level index could help if the top level is equi-width bins of 1000 positions. Andrew On Sep 15, 2009, at 1:38 PM, K. John Wu wrote: > Hi, Andrew, > > You are right that DNA should be indexed with equality encoding and > without binning. If you have not specified anything else, FastBit > should be able to determine this one -- this is the easy case. If you > want to make it very explicit, the index specification can be > > index=<binning none/><encoding equality/> > > Both start and end are integers that have very large number of > distinct values. Furthermore, the query boundaries (i_start and > i_end) can be arbitrary integers as well. In this case, your best bet > might be > > index=<binning none/><encoding binary/> > > which produces a bit-sliced index (or binary encoded index). Since it > is a precise index, it can answer queries with any arbitrary query > boundaries without the need to scan the base data. Should be pretty > fast. > > Alternatively, if your query boundaries are typically of the form > 2.2e5 (220,000), i.e., with relatively few significant digits (2 in > this particular example), then it is possible to bin start and end > with sufficient number of significant digits to avoid the need to > access the base data. Say your query boundaries never use more than > three significant digits for mantissa (could have arbitrary > exponents), then you might consider > > index=<binning precision=3/><encoding integer-equality/> > > There are few other cases spelled out in > <http://crd.lbl.gov/~kewu/fastbit/doc/indexSpec.html> as well. Let us > know if you have additional questions. > > John > > > On 9/15/2009 10:12 AM, Andrew Olson wrote: >> Hi John, >> I am interested in tuning my FastBit indexes to optimize query >> performance while keeping the file sizes from growing too much. The >> most common query that my users pose extracts features that overlap >> with a specific genomic interval. It uses these three columns: >> >> dna - low cardinality (<100) >> start - high cardinality (min = ~1 max = ~250 million) >> end - high cardinality (min =~ 1 max = ~250 million) >> >> SELECT col1,col2,col3 FROM table WHERE dna = dna_id AND end > i_start >> AND start < i_end >> >> The size of the query interval (i_end - i_start) varies from 20 to >> 200,000. >> From my reading of the FastBit literature, it looks like the dna >> column should be equality encoded and the other 2 columns should be >> binned and range encoded, but I could be mistaken. >> >> 1. What are the default index specs for these columns? >> 2. Which other options should I try? >> 3. More generally, does the choice of index spec impact other >> functions (histograms)? >> >> Thanks, >> Andrew >> _______________________________________________ >> FastBit-users mailing list >> [email protected] >> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
