The proposal sounds interesting. Will the indexes be maintained in the same "transaction" of an update? So, if an update to a row is successful, but index maintenance fails, would you roll back the row update?
Are you also considering composite (multi-column) indexes like those used in Google's App Engine? We have been thinking about adding to HBase a Lucene-based text index, which can be maintained asynchronously with the table. Are you interested in text index too? Jun [EMAIL PROTECTED] wrote on 04/22/2008 10:31:02 AM: > All, > > We want to put secondary indexes into hbase. The motivation is that we > are storing data in hbase that we want to serve to users. We would > like to be able to serve rows sorted by column values. Our queries > will be over rows with a given key prefix, so we should not be hitting > to many regions. > > I was thinking it would work roughly like this: > > - At table creation time, individual columns can be declared as > indexed. By default we could sort the column values lexicographically, > or we can provide a WritableComparatorFactory<T> which has the ability > to make values of type T from a byte [], as well as providing a > Comparator<T>. (Better than providing a Comparator<byte[]> as it only > costs once per row insert for deserialization, rather that twice on > each comparison). > > - We catch all writes/deletes and maintain a SortedMap<T, HStoreKey> > which keeps the column values in order, and maps them back to row > keys. First cut may just keep all this in memory, but it should be > backed with MapFile(s). > > - Add to the hregion the ability to scan through keys in column order. > Just iterate through the SortedMap, run a filter on the key, and if it > passes do a get on the row. > > - Add a ColumnOrderedClientScanner which will open column order > scanners to all applicable hregions, and continuously pick row with > the lowest column value from each of the client scanners. > > - Region splits should be easy enough, just a scan through the > SortedMap to partition. > > Of course, the index could also be used for more efficient querying on > the indexed column's values. > > Do other users have a need for this functionality? > > What do developers think about this? I know hbase is more intended for > back-end batch style processing, but we have this need. > > Cheers, > -clint
