Re: indexing question

tim robertson Fri, 03 Jul 2009 00:25:09 -0700

Hi,

Disclaimer: I am a newbie, so this is just one option, and I am basing
on my understanding that secondary indexes are not yet working on
HBase...


So since HBase has very fast "get by primary key", but is *still* (?)
without working secondary indexes, you would need to do scans to find
the records.  A workaround would be to have 2 more tables
"Country_Contains" and "Country_Contained_In", and in each table, the
primary key is the unique ID of the country, the payload being the
Keys to the rows in the main table.  Basically this is creating 2
tables to act as the index manually.  This is a duplication of data,
and would require management of 3 tables wrapped in a transaction when
doing CRUD, but it would allow for lookup of the rows to modify
without need for scanning.

Just one idea...

Cheers,

Tim




On Fri, Jul 3, 2009 at 9:10 AM, Ishaaq Chandy<[email protected]> wrote:
> Hi all,
> I am pretty new to HBase so forgive me if this seems like a silly question.
>
> Each row in my Hbase table is a geographical location that is related to
> other locations. For e.g. one relationship is the CONTAIN relationship. So,
> Europe CONTAINs  England, France, Spain etc. There is an inverse
> relationship as well called PARENT, so England has a PARENT called Europe.
> However, note that, for various business reasons not pertinant to this
> discussion, the inverse relationship need not always be set, i.e. we may not
> store France with a PARENT value of Europe, even though Europe CONTAINs
> France.
>
> So, I store each location as a row with an id and the payload data for that
> location as a separate data column. This data column includes the sets of
> ids of the related locations.
>
> Now, I want to be able to update/delete locations consistently. So, in my
> example above, I might want to delete France, in which case I also want to
> make sure that I delete the CONTAINs relationship that Europe has with
> France as that is now obsolete. What is the most efficient way to do this? I
> want to minimise the number of writes I would have to do - on the other hand
> optimising read performance is more important as writes do not happen that
> often (this is geographic data after all).
>
> My thoughts are: I will have to do 1+n writes to do a delete - i.e. 1 write
> operation to delete France and n write operations to delete the
> relationships that n other locations may have to France. In the case of a
> root location like Europe that may have a large number of locations that
> relate to it this may be expensive, but I see no other way.
>
> So, I was wondering, how do I index this to speed this up as far as
> possible. So, given the location Europe, what are the fields I should
> include in its row and how to index them? I could create a column family for
> each relationship type with a label - the label being the id of the location
> this location is related to, so, for e.g., the Europe row would have a
> column called CONTAIN:England (assuming "England" is the id for the England
> column - in reality it would be a UUID). I would then have as many labels
> under the CONTAIN family for Europe as locations that Europe contains.
>
> How would I index this and ensure that when deleting France the query: "list
> all locations that CONTAIN France" returns with Europe (and whatever else)
> as quickly as possible?
>
> Thanks,
> Ishaaq
>

Re: indexing question

Reply via email to