Hi, Andrew,

You can place index specification in -part.txt.  For example,

BEGIN Column
name=end
data_type=Unsigned
index=<binning none/><encoding binary/>
END Column

The problem with interval encoding (<encoding interval/>) is that it 
produces very large indexes.  FastBit will probably crash if you 
attempt to build an interval encoding index (without binning or with a 
fairly large number of bins).  To make it work, you will have to bin 
the data very coarsely, say, no more than 100 bins.

The better way to use interval encoding is to use it in the coarse 
level of a two-level index.  In this case, FastBit has some formula to 
determine how many coarse bins to use to make sure the interval 
encoded bitmaps do not take up too much space.  However, even with two 
levels, the columns start and end might have too many distinct values 
to build a precise index.  Let C denote the number of distinct values 
and N denote the number of rows in the data set.  If C > N / 10, then 
there are too many distinct value for a precise/unbinned index (unless 
of course N is really small).

Since we have to use binning, then the most important consideration is 
whether the bin boundaries match the expected query boundaries.  In 
the previous example, i_begin and i_end are query boundaries.  If 
query boundaries do not match bin boundaries, then the binned index 
can not accurately answer the query, it has to go back to the base 
data.   For example, you have the following start (assuming the 
minimum and the maximum are accurate)

BEGIN Column
name=start
data_type=Unsigned
minimum=0
maximum=100000000
index=<binning begin=0 end=1e8 nbins=10000/><encoding interval-equality/>
END Column

(note: index specification must on one line in -part.txt)

This index can handle i_begin and i_end that are multiples of 10,000 
efficiently.  If you have "2000 < start < 78000" in a query, then the 
index is insufficient to resolve this range condition, FastBit will 
have to go back to the raw data.

In the above example, the values 2000 and 78000 are multiples of 1000, 
therefore, you can increase nbins to 100000 and keep bin boundaries at 
the same resolution as the query boundaries.

Another way to think about the query boundaries is that both 2000 
(2e3) and 78000 (7.8e4) have no more than 2 significant digits. 
Instead of using equal-width binning, you can use the following index 
specification

index=<binning precision=2/><encoding interval-equality/>

If you plane to keep the number of bins relatively small, there is a 
way to improve the speed of candidate checking -- the process of 
accessing the base data to resolve those records that can not be 
resolved by the binned index.  In the binning option, add keyword 
"reorder", for example

index=<binning precision=1 reorder/><encoding interval-equality/>

The option "reorder" tells the indexing building function to produce a 
clustered version of the based data (OrBiC: Order-preserving Bin-based 
Clustering).  When the candidate checking is necessary, it can use 
this clustered data to speedup the process.

Oh, well, the message is getting quite long.  Hope you find useful 
information here.  Feel free to follow up with more questions.

John



On 9/15/2009 11:08 AM, Andrew Olson wrote:
> John,
> Thanks for the tips.  Can this be specified in the -part.txt?  I have  
> one follow up question.  Suppose that the features mapped to the dna  
> are short enough that you don't care about features that start before  
> the query window or end after the query window.  The query would now be:
> SELECT col1,col2,col3 FROM table WHERE dna=dna_id AND i_start < start  
> < i_end
> 
> In this case, would it make sense for the start column to have an  
> interval encoded index?  Also, I read Rishi Sinha's paper on multi- 
> level indexes and it seems like a 2 level index could help if the top  
> level is equi-width bins of 1000 positions.
> 
> Andrew
> 
> 
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to