Hi, Andrew,

You are right that DNA should be indexed with equality encoding and 
without binning.  If you have not specified anything else, FastBit 
should be able to determine this one -- this is the easy case.  If you 
want to make it very explicit, the index specification can be

index=<binning none/><encoding equality/>

Both start and end are integers that have very large number of 
distinct values.  Furthermore, the query boundaries (i_start and 
i_end) can be arbitrary integers as well.  In this case, your best bet 
might be

index=<binning none/><encoding binary/>

which produces a bit-sliced index (or binary encoded index).  Since it 
is a precise index, it can answer queries with any arbitrary query 
boundaries without the need to scan the base data.  Should be pretty fast.

Alternatively, if your query boundaries are typically of the form 
2.2e5 (220,000), i.e., with relatively few significant digits (2 in 
this particular example), then it is possible to bin start and end 
with sufficient number of significant digits to avoid the need to 
access the base data.  Say your query boundaries never use more than 
three significant digits for mantissa (could have arbitrary 
exponents), then you might consider

index=<binning precision=3/><encoding integer-equality/>

There are few other cases spelled out in 
<http://crd.lbl.gov/~kewu/fastbit/doc/indexSpec.html> as well.  Let us 
know if you have additional questions.

John


On 9/15/2009 10:12 AM, Andrew Olson wrote:
> Hi John,
> I am interested in tuning my FastBit indexes to optimize query  
> performance while keeping the file sizes from growing too much.  The  
> most common query that my users pose extracts features that overlap  
> with a specific genomic interval.  It uses these three columns:
> 
> dna - low cardinality (<100)
> start - high cardinality (min = ~1 max = ~250 million)
> end - high cardinality (min =~ 1 max = ~250 million)
> 
> SELECT col1,col2,col3 FROM table WHERE dna = dna_id AND end > i_start  
> AND start < i_end
> 
> The size of the query interval (i_end - i_start) varies from 20 to  
> 200,000.
>  From my reading of the FastBit literature, it looks like the dna  
> column should be equality encoded and the other 2 columns should be  
> binned and range encoded, but I could be mistaken.
> 
> 1. What are the default index specs for these columns?
> 2. Which other options should I try?
> 3. More generally, does the choice of index spec impact other  
> functions (histograms)?
> 
> Thanks,
> Andrew
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to