Hi, I am currently evaluating HBase for an implementation of an ERP-like cloud solution that's supposed to handle 500M lines per year for the biggest tenant and 10-20m for the smaller tenants. I am writing a couple prototypes, one using MySQL (sharded) and one with HBase - I will let you know what I find if you are interested. Anyway, I have 2 questions:
The first one is regarding the following post and I would like to get a perspective from the no-sql camp on this one. http://www.quora.com/Why-does-Quora-use-MySQL-as-the-data-store-rather-than-NoSQLs-such-as-Cassandra-MongoDB-CouchDB-etc The second is regarding how to best implement a 'duplicate check' validation. Here is what I have done so far: I have a single entity table and I have created an indexed table where the key is the concatenated value of the 4 attributes of the entity (these 4 attributes are the definition of what constitutes a duplicate record while the entity can have around 100-150 different attributes). In this indexed table, I have a column in which I store a comma delimited list of all the keys that corresponds to entities that have the same set of 4 attribute values. For example: (assuming that a dup is defined by entities having values of a and b be the same) EntityTable: key, a, b, c, d, e 1, 1, 1, 1, 1, 1 2, 1, 1, 2, 2, 2 3, 1, 2, 2, 2, 2 4, 2, 2, 2, 2, 2 IndexTable: key, value 11, [1, 2] 12, [3] 22, [4] When I scan through my Entity table, I plan on looking up the index table by the dup key and add the current entity key in it? I am worried about this look up per entity record for performance reasons? To make things more complicated, I should be able to change the set of keys that define a dup. I handle that by recreating my index table. Is there a better way to write a dup check? Thanks a lot for your help, -Eric
