Hi, Andrew, You are right that DNA should be indexed with equality encoding and without binning. If you have not specified anything else, FastBit should be able to determine this one -- this is the easy case. If you want to make it very explicit, the index specification can be
index=<binning none/><encoding equality/> Both start and end are integers that have very large number of distinct values. Furthermore, the query boundaries (i_start and i_end) can be arbitrary integers as well. In this case, your best bet might be index=<binning none/><encoding binary/> which produces a bit-sliced index (or binary encoded index). Since it is a precise index, it can answer queries with any arbitrary query boundaries without the need to scan the base data. Should be pretty fast. Alternatively, if your query boundaries are typically of the form 2.2e5 (220,000), i.e., with relatively few significant digits (2 in this particular example), then it is possible to bin start and end with sufficient number of significant digits to avoid the need to access the base data. Say your query boundaries never use more than three significant digits for mantissa (could have arbitrary exponents), then you might consider index=<binning precision=3/><encoding integer-equality/> There are few other cases spelled out in <http://crd.lbl.gov/~kewu/fastbit/doc/indexSpec.html> as well. Let us know if you have additional questions. John On 9/15/2009 10:12 AM, Andrew Olson wrote: > Hi John, > I am interested in tuning my FastBit indexes to optimize query > performance while keeping the file sizes from growing too much. The > most common query that my users pose extracts features that overlap > with a specific genomic interval. It uses these three columns: > > dna - low cardinality (<100) > start - high cardinality (min = ~1 max = ~250 million) > end - high cardinality (min =~ 1 max = ~250 million) > > SELECT col1,col2,col3 FROM table WHERE dna = dna_id AND end > i_start > AND start < i_end > > The size of the query interval (i_end - i_start) varies from 20 to > 200,000. > From my reading of the FastBit literature, it looks like the dna > column should be equality encoded and the other 2 columns should be > binned and range encoded, but I could be mistaken. > > 1. What are the default index specs for these columns? > 2. Which other options should I try? > 3. More generally, does the choice of index spec impact other > functions (histograms)? > > Thanks, > Andrew > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
