[ 
https://issues.apache.org/jira/browse/HBASE-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857469#action_12857469
 ] 

Clint Morgan commented on HBASE-2426:
-------------------------------------

Hey George, thanks for the patch.

I have a question about how this improves performance over an
index layout similar to the SimpleIndexKeyGenerator. I have the same
requirements you mention above: namely I'd like to quickly finda all
rows in table A which have a value for COL1 of 'X'.

I build my index keys like <col1-value><sep><base-row-id> where <sep>
is a special byte sequence that does not occur in column values or row
keys. (Actually it can occur, if so I just escape it in the
index-row). Lets say <sep> is '__' in the example below

So if I have base rows:
ROW | COL_A
aaa | foo
bbb | bar
ccc | foo
ddd | zoo

Then my index would look like (just the rows are shown):
bar__bbb
foo__aaa
foo__ccc
zoo__ddd

So for the query find all rows where COL_A == foo, I do an index scan
starting at 'foo__' and ending at 'foo_*' (where * is the byte after
'_').

This will only scan through only the two index rows I wanted. Looks
like your patch will make it so rather than scanning two rows with on
cell each I scan one row with two cells each. I'm not 100% sure on the
specifics, but I think these two queries would generally be of the
same order of performance.

Do I understand things correctly? Is there a reason you could not use
the existing index mechanism for your needs?

I think we could do some work to make this pattern more obvious and
usable with the current infrastructure, but I'm a bit hesitant to add yet
another region/regionserver extension.

George, what do you think?

Slightly aside: When I read about AppEngine's index (a year ago or so), they 
said that they maintain N index rows for a single base row (1 per column being 
indexed). I've been wanting to rework this framework to support that as well, 
but it has not been a high priority as it would require a rewrite of our query 
stuff that uses the current indexing layer. The approach you take is the 
opposite: 1 index row for for N base rows. Not sure that really says anything, 
but ...

> [Transactional Contrib] Introduce quick scanning row-based secondary indexes
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-2426
>                 URL: https://issues.apache.org/jira/browse/HBASE-2426
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: contrib
>            Reporter: George P. Stathis
>            Priority: Minor
>             Fix For: 0.20.5, 0.21.0
>
>         Attachments: hbase-2426-0.20-branch.patch
>
>
> RowBasedIndexSpecification is a specialized IndexSpecification class for 
> creating row-based secondary index tables. Base table rows with the same 
> indexed column value have their row keys stored as column qualifiers on the 
> same secondary index table row. The key for that row is the indexed column 
> value from the base table. This allows to avoid expensive secondary index 
> table scans and provides faster access for applications such as foreign key 
> indexing or queries such as "find all table A rows whose familyA:columnB 
> value is X". RowBasedIndexSpecification indices can be scanned using the API 
> on RowBasedIndexedTable. The metadata for RowBasedIndexSpecification differ 
> from IndexSpecification in that:
> - Only a single base table column can be indexed per 
> RowBasedIndexSpecification. No additional columns are put in the index table.
> and 
> - RowBasedIndexKeyGenerator, which constructs the index-row-key from the 
> indexed column value in the original column, is always used.
> For a simple RowBasedIndexSpecification example, look at the 
> TestRowBasedIndexedTable unit test in 
> org.apache.hadoop.hbase.client.tableIndexed.
> To enable RowBasedIndexSpecification indexing, modify hbase-site.xml to turn 
> on the
> IndexedRegionServer.  This is done by setting
> - hbase.regionserver.class to 
> org.apache.hadoop.hbase.ipc.IndexedRegionInterface and
> - hbase.regionserver.impl to 
> org.apache.hadoop.hbase.regionserver.tableindexed.RowBasedIndexedRegionServer

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to