Hi Eric,
The duplicate record problem is addressed by Nutch by designing a signature.

This signature helps them to find whether the information is duplicated or
not. Your design is also good..

However, there is one possible issue with this.. that is after you read a
key and matching documents, [ a, b, c, d, e ]. For the details you may need
to do a random read.. This will be a slow process.

MySQL (sharded) vs HBase - Please share the findings..

Cheers
Abinash Karan

-----Original Message-----
From: [email protected] [mailto:[email protected]] 
Sent: Monday, January 17, 2011 5:52 PM
To: [email protected]
Subject: evaluating HBase

Hi,

I am currently evaluating HBase for an implementation of an ERP-like cloud 
solution that's supposed to handle 500M lines per year for the biggest
tenant 
and 10-20m for the smaller tenants.  I am writing a couple prototypes, one
using 
MySQL (sharded) and one with HBase - I will let you know what I find if you
are 
interested.  Anyway, I have 2 questions:

The first one is regarding the following post and I would like to get a 
perspective from the no-sql camp on this one.
http://www.quora.com/Why-does-Quora-use-MySQL-as-the-data-store-rather-than-
NoSQLs-such-as-Cassandra-MongoDB-CouchDB-etc


The second is regarding how to best implement a 'duplicate check'
validation. 
 Here is what I have done so far: I have a single entity table and I have 
created an indexed table where the key is the concatenated value of the 4 
attributes of the entity (these 4 attributes are the definition of what 
constitutes a duplicate record while the entity can have around 100-150 
different attributes).  In this indexed table, I have a column in which I
store 
a comma delimited list of all the keys that corresponds to entities that
have 
the same set of 4 attribute values.

For example: (assuming that a dup is defined by entities having values of a
and 
b be the same)

EntityTable:
key, a, b, c, d, e
1, 1, 1, 1, 1, 1
2, 1, 1, 2, 2, 2
3, 1, 2, 2, 2, 2
4, 2, 2, 2, 2, 2

IndexTable:
key, value
11, [1, 2]
12, [3]
22, [4]

When I scan through my Entity table, I plan on looking up the index table by
the 
dup key and add the current entity key in it?  I am worried about this look
up 
per entity record for performance reasons?  To make things more complicated,
I 
should be able to change the set of keys that define a dup.  I handle that
by 
recreating my index table.

Is there a better way to write a dup check?

Thanks a lot for your help,
-Eric

Reply via email to