Guys, 

Sorry to be a debbie downer here, but really this is not a good idea. Here's 
why:

In terms of design, you have some serious scalability and performance issues 
when compared to alternatives. 


Let me try to give you a real life example. *

CCCIS (CCC Information Services) is the middle man in the US between the auto 
repair shop and the insurance company. They have one competitor but they handle 
most of the accident claims in the US.  
So when you go to your authorized repair shop, they have this application 
called Pathways which takes down all of your information and the accident, the 
parts required to be replaced and sends it first to CCC which then sends it on 
to your insurance company. In short CCC collects a lot of information about the 
type of vehicles, the accidents, the cost of parts, labor to put your car back 
on the road.  As the middle man they collect a lot of very useful information…

So imagine you have a large data warehouse in HBase of all of the claims. Your 
primary key is going to be a composite of the insurer and the claim_id.  

But you're going to want to also index based on the make/model, type of 
accident, driver details, location… , VIN

This will allow your actuaries to figure out the average cost of a front end 
collision, by make and model, by state/zip.
Or by age bracket, who's a better driver? 

Imagine that the claim table will have a column for the claim in its entirety  
as an Avro doc (JSON) along with the important fields broken out separately.  
(For this example the schema isn't that important.) 

So you want to find the average cost of a front end collision of a VOLVO S80 
for the past 3 model years.

Now, you have an index based on manufacturer/model/year. 

Using your index scheme, you now have to query every RS for the row keys in the 
index.
Then you have to take these results and then put them in a sort order in order 
to use the index.

Note: This isn't too bad if you're doing a simple query against one index. You 
can do the work by RS and then join the results from all RS.

However… what happens if you have two indexes and your result set is going to 
be the intersection of the indexes?

Or you're going to do a join between two tables using the indexes to limit the 
result set? 

Now your design breaks down quickly. 

And then there's another problem. 
Your index may be relatively much smaller than your base table. 
In this example… the insurance claim is a huge record.  I would say 2-3 orders 
of magnitude  larger than the row key.  Since you split your index at the same 
rate you split your table… you will have a lot of regions for your index.

Again,this may lead to other issues….

Is it better than doing a full table scan? Sure. 

Are there better alternatives? 
Yes. 
Apply KISS. (Keep it simple) 

Still using an inverted table, let HBase manage it rather than trying to tie it 
to the underlying base table. 
While its not perfect, its lighter, and will perform better in the general use 
cases.  (You could even use Async HBase to decouple the write to the base table 
and the update to the index.) 

Same model could be applied to a Lucene index as well.

Just Saying…. 

-Mike


*FULL DISCLOSURE
I am a consultant and CCC was a client of mine back in the late '90s.  In one 
project I worked on ProEFT (now defunct) and an ODS, also now defunct.  The 
example is a hypothetical of what I would do if I were CCC and wanted to use 
Big Data to help manage Auto claims. Any resemblance to any actual work being 
done by CCC in the Big Data space is pure coincidence. ;-)

On Aug 13, 2013, at 1:31 PM, Andrew Purtell <[email protected]> wrote:

> Thanks so much for the contribution!
> 
> On Mon, Aug 12, 2013 at 11:19 PM, rajeshbabu chintaguntla <
> [email protected]> wrote:
> 
>> Hi,
>> 
>> We have been working on implementing secondary index in HBase, and had
>> shared an overview of our design in the 2012  Hadoop Technical Conference
>> at Beijing(http://bit.ly/hbtc12-hindex). We are pleased to open source it
>> today.
>> 
>> The project is available on github.
>>  https://github.com/Huawei-Hadoop/hindex
>> 
>> It is 100% Java, compatible with Apache HBase 0.94.8, and is open sourced
>> under Apache Software License v2.
>> 
>> Following features are supported currently.
>> -          multiple indexes on table,
>> -          multi column index,
>> -          index based on part of a column value,
>> -          equals and range condition scans using index, and
>> -          bulk loading data to indexed table (Indexing done with bulk
>> load)
>> 
>> We now plan to raise HBase JIRA(s) to make it available in Apache release,
>> and can hopefully continue our work on this in the community.
>> 
>> Regards
>> Rajeshbabu
>> 
>> 
> 
> 
> -- 
> Best regards,
> 
>   - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)

Reply via email to