Hi, Disclaimer: I am a newbie, so this is just one option, and I am basing on my understanding that secondary indexes are not yet working on HBase...
So since HBase has very fast "get by primary key", but is *still* (?) without working secondary indexes, you would need to do scans to find the records. A workaround would be to have 2 more tables "Country_Contains" and "Country_Contained_In", and in each table, the primary key is the unique ID of the country, the payload being the Keys to the rows in the main table. Basically this is creating 2 tables to act as the index manually. This is a duplication of data, and would require management of 3 tables wrapped in a transaction when doing CRUD, but it would allow for lookup of the rows to modify without need for scanning. Just one idea... Cheers, Tim On Fri, Jul 3, 2009 at 9:10 AM, Ishaaq Chandy<[email protected]> wrote: > Hi all, > I am pretty new to HBase so forgive me if this seems like a silly question. > > Each row in my Hbase table is a geographical location that is related to > other locations. For e.g. one relationship is the CONTAIN relationship. So, > Europe CONTAINs England, France, Spain etc. There is an inverse > relationship as well called PARENT, so England has a PARENT called Europe. > However, note that, for various business reasons not pertinant to this > discussion, the inverse relationship need not always be set, i.e. we may not > store France with a PARENT value of Europe, even though Europe CONTAINs > France. > > So, I store each location as a row with an id and the payload data for that > location as a separate data column. This data column includes the sets of > ids of the related locations. > > Now, I want to be able to update/delete locations consistently. So, in my > example above, I might want to delete France, in which case I also want to > make sure that I delete the CONTAINs relationship that Europe has with > France as that is now obsolete. What is the most efficient way to do this? I > want to minimise the number of writes I would have to do - on the other hand > optimising read performance is more important as writes do not happen that > often (this is geographic data after all). > > My thoughts are: I will have to do 1+n writes to do a delete - i.e. 1 write > operation to delete France and n write operations to delete the > relationships that n other locations may have to France. In the case of a > root location like Europe that may have a large number of locations that > relate to it this may be expensive, but I see no other way. > > So, I was wondering, how do I index this to speed this up as far as > possible. So, given the location Europe, what are the fields I should > include in its row and how to index them? I could create a column family for > each relationship type with a label - the label being the id of the location > this location is related to, so, for e.g., the Europe row would have a > column called CONTAIN:England (assuming "England" is the id for the England > column - in reality it would be a UUID). I would then have as many labels > under the CONTAIN family for Europe as locations that Europe contains. > > How would I index this and ensure that when deleting France the query: "list > all locations that CONTAIN France" returns with Europe (and whatever else) > as quickly as possible? > > Thanks, > Ishaaq >
