evaluating HBase

eric_bdr Mon, 17 Jan 2011 04:22:39 -0800

Hi,

I am currently evaluating HBase for an implementation of an ERP-like cloud 
solution that's supposed to handle 500M lines per year for the biggest tenant 
and 10-20m for the smaller tenants.  I am writing a couple prototypes, one 
using 
MySQL (sharded) and one with HBase - I will let you know what I find if you are 
interested.  Anyway, I have 2 questions:


The first one is regarding the following post and I would like to get a 
perspective from the no-sql camp on this one.
http://www.quora.com/Why-does-Quora-use-MySQL-as-the-data-store-rather-than-NoSQLs-such-as-Cassandra-MongoDB-CouchDB-etc


The second is regarding how to best implement a 'duplicate check' validation. 
 Here is what I have done so far: I have a single entity table and I have 
created an indexed table where the key is the concatenated value of the 4 
attributes of the entity (these 4 attributes are the definition of what 
constitutes a duplicate record while the entity can have around 100-150 
different attributes).  In this indexed table, I have a column in which I store 
a comma delimited list of all the keys that corresponds to entities that have 
the same set of 4 attribute values.

For example: (assuming that a dup is defined by entities having values of a and 
b be the same)

EntityTable:
key, a, b, c, d, e
1, 1, 1, 1, 1, 1
2, 1, 1, 2, 2, 2
3, 1, 2, 2, 2, 2
4, 2, 2, 2, 2, 2

IndexTable:
key, value
11, [1, 2]
12, [3]
22, [4]

When I scan through my Entity table, I plan on looking up the index table by 
the 
dup key and add the current entity key in it?  I am worried about this look up 
per entity record for performance reasons?  To make things more complicated, I 
should be able to change the set of keys that define a dup.  I handle that by 
recreating my index table.

Is there a better way to write a dup check?

Thanks a lot for your help,
-Eric

evaluating HBase

Reply via email to