All, We want to put secondary indexes into hbase. The motivation is that we are storing data in hbase that we want to serve to users. We would like to be able to serve rows sorted by column values. Our queries will be over rows with a given key prefix, so we should not be hitting to many regions.
I was thinking it would work roughly like this: - At table creation time, individual columns can be declared as indexed. By default we could sort the column values lexicographically, or we can provide a WritableComparatorFactory<T> which has the ability to make values of type T from a byte [], as well as providing a Comparator<T>. (Better than providing a Comparator<byte[]> as it only costs once per row insert for deserialization, rather that twice on each comparison). - We catch all writes/deletes and maintain a SortedMap<T, HStoreKey> which keeps the column values in order, and maps them back to row keys. First cut may just keep all this in memory, but it should be backed with MapFile(s). - Add to the hregion the ability to scan through keys in column order. Just iterate through the SortedMap, run a filter on the key, and if it passes do a get on the row. - Add a ColumnOrderedClientScanner which will open column order scanners to all applicable hregions, and continuously pick row with the lowest column value from each of the client scanners. - Region splits should be easy enough, just a scan through the SortedMap to partition. Of course, the index could also be used for more efficient querying on the indexed column's values. Do other users have a need for this functionality? What do developers think about this? I know hbase is more intended for back-end batch style processing, but we have this need. Cheers, -clint
