Hi, Andrew, You can place index specification in -part.txt. For example,
BEGIN Column name=end data_type=Unsigned index=<binning none/><encoding binary/> END Column The problem with interval encoding (<encoding interval/>) is that it produces very large indexes. FastBit will probably crash if you attempt to build an interval encoding index (without binning or with a fairly large number of bins). To make it work, you will have to bin the data very coarsely, say, no more than 100 bins. The better way to use interval encoding is to use it in the coarse level of a two-level index. In this case, FastBit has some formula to determine how many coarse bins to use to make sure the interval encoded bitmaps do not take up too much space. However, even with two levels, the columns start and end might have too many distinct values to build a precise index. Let C denote the number of distinct values and N denote the number of rows in the data set. If C > N / 10, then there are too many distinct value for a precise/unbinned index (unless of course N is really small). Since we have to use binning, then the most important consideration is whether the bin boundaries match the expected query boundaries. In the previous example, i_begin and i_end are query boundaries. If query boundaries do not match bin boundaries, then the binned index can not accurately answer the query, it has to go back to the base data. For example, you have the following start (assuming the minimum and the maximum are accurate) BEGIN Column name=start data_type=Unsigned minimum=0 maximum=100000000 index=<binning begin=0 end=1e8 nbins=10000/><encoding interval-equality/> END Column (note: index specification must on one line in -part.txt) This index can handle i_begin and i_end that are multiples of 10,000 efficiently. If you have "2000 < start < 78000" in a query, then the index is insufficient to resolve this range condition, FastBit will have to go back to the raw data. In the above example, the values 2000 and 78000 are multiples of 1000, therefore, you can increase nbins to 100000 and keep bin boundaries at the same resolution as the query boundaries. Another way to think about the query boundaries is that both 2000 (2e3) and 78000 (7.8e4) have no more than 2 significant digits. Instead of using equal-width binning, you can use the following index specification index=<binning precision=2/><encoding interval-equality/> If you plane to keep the number of bins relatively small, there is a way to improve the speed of candidate checking -- the process of accessing the base data to resolve those records that can not be resolved by the binned index. In the binning option, add keyword "reorder", for example index=<binning precision=1 reorder/><encoding interval-equality/> The option "reorder" tells the indexing building function to produce a clustered version of the based data (OrBiC: Order-preserving Bin-based Clustering). When the candidate checking is necessary, it can use this clustered data to speedup the process. Oh, well, the message is getting quite long. Hope you find useful information here. Feel free to follow up with more questions. John On 9/15/2009 11:08 AM, Andrew Olson wrote: > John, > Thanks for the tips. Can this be specified in the -part.txt? I have > one follow up question. Suppose that the features mapped to the dna > are short enough that you don't care about features that start before > the query window or end after the query window. The query would now be: > SELECT col1,col2,col3 FROM table WHERE dna=dna_id AND i_start < start > < i_end > > In this case, would it make sense for the start column to have an > interval encoded index? Also, I read Rishi Sinha's paper on multi- > level indexes and it seems like a 2 level index could help if the top > level is equi-width bins of 1000 positions. > > Andrew > > _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
