Alternate indexed hbase implementation; speeds scans by adding indexes to 
regions rather secondary tables
---------------------------------------------------------------------------------------------------------

                 Key: HBASE-2037
                 URL: https://issues.apache.org/jira/browse/HBASE-2037
             Project: Hadoop HBase
          Issue Type: New Feature
            Reporter: stack
             Fix For: 0.20.3


Purpose

The goal of the indexed HBase contrib is to speed up scans by indexing HBase 
columns. Indexed HBase (IHbase) is different from the indexed tables in 
transactional HBase (ITHbase): while the indexes in ITHBase are, in fact, hbase 
tables using the indexed column's values as row keys, IHbase creates indexes at 
the region level. The differences are summarized in below.

+ global ordering
ITHBase: yes
IHBase: no
Comment: IHBase has an index for each region. The flip side of not having 
global ordering is compatibility with the good old HRegion: results are coming 
back in row order (and not value order as in THBase)

+ Full table scan?
ITHBase: no
IHBase: no
Comment: ITHbase does a partial scan on the index table. IHbase supports 
specifying start/end rows to limit the number of scanned regions

+ Multiple Index Usage
ITHBase: no
IHBase: yes
Comment: IHBase can take advantage of multiple indexes in the same scan. IHBase 
IdxScan object accepts an Expression which allows intersection/ unison of 
several indexed 
column criteria

+ Extra disk storage
ITHBase: yes
IHBase: no
Comment: IHbase indexes are created when the region starts/flushes and do not 
require any extra storage

+ Extra RAM
ITHBase: yes
IHBase: yes
Comment: IHbase indexes are in memory and hence increase the memory overhead. 
THbase indexes increase the number of regions each region server has to support 
thus costing memory too

+ Parallel scanning support
ITHBase: no
IHBase: yes
In ITHbase the index table needs to be consulted and then GETs are issued for 
each matching row. The behavior of IHBase (as perceived by the client) is no 
different than a regular scan and hence supports parallel scanning seamlessly. 
parallel GET can be implemented to speedup ITHbase scans


Why IHbase should outperform ITHBase
1. More flexible: a. Supports range queries and multi-index queries b. Supports 
different types - not only byte arrays
2. Less overhead: ITHbase pays at least two 'table roundtrips' - one for the 
index table and the other for the main table
3. Quicker index expression evaluation: IHBase is using dedicated index data 
structures while ITHbase is using the regular HRegion scan facilities

Implementation notes
• Only index Storefiles.Every index scan performs a full memstore scan. 
Indexing the memstore will be implemented only if scanning the memstore will 
prove to be a performance bottleneck
• Index expression evaluation is performed using bit sets.There are two types 
of bitsets: compressed and expanded. An index will typically store a compressed 
bitset while an expression evaluator will most probably use an expanded bitset
+ TODO

This patch changes some some of hbase core so can instantiate other than 
default HRegion.  Fixes bugs in filter too.

Would like to add this as a contrib. package on 0.20 branch in time for 0.20.3 
if possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to