[ https://issues.apache.org/jira/browse/HBASE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796781#action_12796781 ]
stack commented on HBASE-2037: ------------------------------ I committed this. Will be testing the 0.20.3 RC anyways. Can test this patch at that time. I made hbase-2092 to make a version of this patch for TRUNK. > Alternate indexed hbase implementation; speeds scans by adding indexes to > regions rather secondary tables > --------------------------------------------------------------------------------------------------------- > > Key: HBASE-2037 > URL: https://issues.apache.org/jira/browse/HBASE-2037 > Project: Hadoop HBase > Issue Type: New Feature > Reporter: stack > Fix For: 0.20.3 > > Attachments: idx-hbase2.patch, idx-hbase3.patch, index.html > > > Purpose > The goal of the indexed HBase contrib is to speed up scans by indexing HBase > columns. Indexed HBase (IHbase) is different from the indexed tables in > transactional HBase (ITHbase): while the indexes in ITHBase are, in fact, > hbase tables using the indexed column's values as row keys, IHbase creates > indexes at the region level. The differences are summarized in below. > + global ordering > ITHBase: yes > IHBase: no > Comment: IHBase has an index for each region. The flip side of not having > global ordering is compatibility with the good old HRegion: results are > coming back in row order (and not value order as in THBase) > + Full table scan? > ITHBase: no > IHBase: no > Comment: ITHbase does a partial scan on the index table. IHbase supports > specifying start/end rows to limit the number of scanned regions > + Multiple Index Usage > ITHBase: no > IHBase: yes > Comment: IHBase can take advantage of multiple indexes in the same scan. > IHBase IdxScan object accepts an Expression which allows intersection/ unison > of several indexed > column criteria > + Extra disk storage > ITHBase: yes > IHBase: no > Comment: IHbase indexes are created when the region starts/flushes and do not > require any extra storage > + Extra RAM > ITHBase: yes > IHBase: yes > Comment: IHbase indexes are in memory and hence increase the memory overhead. > THbase indexes increase the number of regions each region server has to > support thus costing memory too > + Parallel scanning support > ITHBase: no > IHBase: yes > In ITHbase the index table needs to be consulted and then GETs are issued for > each matching row. The behavior of IHBase (as perceived by the client) is no > different than a regular scan and hence supports parallel scanning > seamlessly. parallel GET can be implemented to speedup ITHbase scans > Why IHbase should outperform ITHBase > 1. More flexible: a. Supports range queries and multi-index queries b. > Supports different types - not only byte arrays > 2. Less overhead: ITHbase pays at least two 'table roundtrips' - one for the > index table and the other for the main table > 3. Quicker index expression evaluation: IHBase is using dedicated index data > structures while ITHbase is using the regular HRegion scan facilities > Implementation notes > • Only index Storefiles.Every index scan performs a full memstore scan. > Indexing the memstore will be implemented only if scanning the memstore will > prove to be a performance bottleneck > • Index expression evaluation is performed using bit sets.There are two types > of bitsets: compressed and expanded. An index will typically store a > compressed bitset while an expression evaluator will most probably use an > expanded bitset > + TODO > This patch changes some some of hbase core so can instantiate other than > default HRegion. Fixes bugs in filter too. > Would like to add this as a contrib. package on 0.20 branch in time for > 0.20.3 if possible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.