When defining the IndexSpecification for your table, you can pass your own implementation of org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator.
This allows you to control how the row keys are generated for the secondary index table. For example, you could append the original table's row key to the indexed value to ensure uniqueness in referencing the original rows. When you create an indexed scanner, the secondary index code opens and wraps a scanner on the secondary index table, based on the start row you specify (the indexed value you're looking up). It applies any filter passed to rows on the secondary index table, so make sure anything you want to filter on is listed in the "indexed columns" in your IndexSpecification. For any rows returned by the wrapped scanner, the client code then does a get for the original table record (the original row key is stored in the "__INDEX__" column family I think). So in total, when using secondary indexes, you wind up with 1 scan + N gets to look at N rows. At least, this was my understanding of how things worked as of 0.19. I'm actually moving indexing into my app layer as I update to 0.20. Hope this helps. --gh On Mon, Aug 17, 2009 at 1:00 PM, Jonathan Gray<[email protected]> wrote: > I'm actually unsure about that. Look at the code or experiment. > > Seems to me that there would be a uniqueness requirement, otherwise what do > you expect the behavior to be? A get can only return a single row, so > multiple index hits doesn't really make sense. > > Clint? You out there? :) > > JG > > bharath vissapragada wrote: >> >> I got it ... I think this is definitely useful in my app because iam >> performing a full table scan everytime for selecting the rowkeys based on >> some column values . >> >> BUT .. >> >> we can have more than one rowkey for the same column value .Can you >> please >> tell me how they are stored . >> >> Thanks in advance >> >> On Mon, Aug 17, 2009 at 9:27 PM, Jonathan Gray <[email protected]> wrote: >> >>> It's not an actual hash or btree index, but rather secondary indexes in >>> HBase are implemented by creating an additional HBase table. >>> >>> If I have a table "users" (row key is userid) with family "data" and >>> column >>> "email", and I want to index the value in that column... >>> >>> I can create a table "users_email" where the row key is the email address >>> (value from the column in "users" table) and a single column that >>> contains >>> the userid. >>> >>> Doing an "index lookup" would mean doing a get on "users_email" and then >>> using that userid to do a lookup on the "users" table. >>> >>> IndexedTable does this transparently, but still does require two queries. >>> So it's slower than a single query, but certainly faster than a full >>> table >>> scan. >>> >>> If you need hash-level performance on the index lookup, there are lots of >>> solutions outside of HBase that would work... In-memory Java HashMap, >>> Tokyo >>> Cabinet on-disk HashMaps, BerkeleyDB, etc... If you need full-text >>> indexing, >>> you can use Lucene or the like. >>> >>> Make sense? >>> >>> JG >>> >>> >>> bharath vissapragada wrote: >>> >>>> But i have read somewhere that Secondary indexes are somewhat slow >>>> compared >>>> to normal Hbase tables ..Does that effect the performance ? >>>> >>>> Also do you know the type of index created on the column(i mean Hash >>>> type >>>> or >>>> Btree etc) >>>> >>>> On Mon, Aug 17, 2009 at 8:30 PM, Kirill Shabunov <[email protected]> >>>> wrote: >>>> >>>> Hi! >>>>> >>>>> As far as I understand you are talking about the secondary indexes. >>>>> Yes, >>>>> they can be used to quickly get the rowkey by a value in the indexed >>>>> column. >>>>> >>>>> --Kirill >>>>> >>>>> >>>>> bharath vissapragada wrote: >>>>> >>>>> Hi all , >>>>>> >>>>>> I have gone through the IndexedTableAdmin classes in Hbase 0.19.3 API >>>>>> .. >>>>>> I >>>>>> have seen some methods used to create an Indexed Table (on some >>>>>> column).. >>>>>> I >>>>>> have some doubts regarding the same ... >>>>>> >>>>>> 1) Are these somewhat similar to Hash indexes(in RDBMS) where i can >>>>>> easily >>>>>> lookup a column value and find it's corresponding rowkey(s) >>>>>> 2) Can i find any performance gain when i use IndexedTable to search >>>>>> for >>>>>> a >>>>>> paritcular column value .. instead of scanning an entire normal HTable >>>>>> .. >>>>>> >>>>>> Kindly clarify my doubts >>>>>> >>>>>> Thanks in advance >>>>>> >>>>>> >>>>>> >> >
