indexing question

Ishaaq Chandy Fri, 03 Jul 2009 00:11:43 -0700

Hi all,
I am pretty new to HBase so forgive me if this seems like a silly question.


Each row in my Hbase table is a geographical location that is related to
other locations. For e.g. one relationship is the CONTAIN relationship. So,
Europe CONTAINs  England, France, Spain etc. There is an inverse
relationship as well called PARENT, so England has a PARENT called Europe.
However, note that, for various business reasons not pertinant to this
discussion, the inverse relationship need not always be set, i.e. we may not
store France with a PARENT value of Europe, even though Europe CONTAINs
France.

So, I store each location as a row with an id and the payload data for that
location as a separate data column. This data column includes the sets of
ids of the related locations.

Now, I want to be able to update/delete locations consistently. So, in my
example above, I might want to delete France, in which case I also want to
make sure that I delete the CONTAINs relationship that Europe has with
France as that is now obsolete. What is the most efficient way to do this? I
want to minimise the number of writes I would have to do - on the other hand
optimising read performance is more important as writes do not happen that
often (this is geographic data after all).

My thoughts are: I will have to do 1+n writes to do a delete - i.e. 1 write
operation to delete France and n write operations to delete the
relationships that n other locations may have to France. In the case of a
root location like Europe that may have a large number of locations that
relate to it this may be expensive, but I see no other way.

So, I was wondering, how do I index this to speed this up as far as
possible. So, given the location Europe, what are the fields I should
include in its row and how to index them? I could create a column family for
each relationship type with a label - the label being the id of the location
this location is related to, so, for e.g., the Europe row would have a
column called CONTAIN:England (assuming "England" is the id for the England
column - in reality it would be a UUID). I would then have as many labels
under the CONTAIN family for Europe as locations that Europe contains.

How would I index this and ensure that when deleting France the query: "list
all locations that CONTAIN France" returns with Europe (and whatever else)
as quickly as possible?

Thanks,
Ishaaq

indexing question

Reply via email to