Those 2 tables could be collapsed into 1 table with 2 columns of course...
On Fri, Jul 3, 2009 at 9:24 AM, tim robertson<[email protected]> wrote: > Hi, > > Disclaimer: I am a newbie, so this is just one option, and I am basing > on my understanding that secondary indexes are not yet working on > HBase... > > So since HBase has very fast "get by primary key", but is *still* (?) > without working secondary indexes, you would need to do scans to find > the records. A workaround would be to have 2 more tables > "Country_Contains" and "Country_Contained_In", and in each table, the > primary key is the unique ID of the country, the payload being the > Keys to the rows in the main table. Basically this is creating 2 > tables to act as the index manually. This is a duplication of data, > and would require management of 3 tables wrapped in a transaction when > doing CRUD, but it would allow for lookup of the rows to modify > without need for scanning. > > Just one idea... > > Cheers, > > Tim > > > > > On Fri, Jul 3, 2009 at 9:10 AM, Ishaaq Chandy<[email protected]> wrote: >> Hi all, >> I am pretty new to HBase so forgive me if this seems like a silly question. >> >> Each row in my Hbase table is a geographical location that is related to >> other locations. For e.g. one relationship is the CONTAIN relationship. So, >> Europe CONTAINs England, France, Spain etc. There is an inverse >> relationship as well called PARENT, so England has a PARENT called Europe. >> However, note that, for various business reasons not pertinant to this >> discussion, the inverse relationship need not always be set, i.e. we may not >> store France with a PARENT value of Europe, even though Europe CONTAINs >> France. >> >> So, I store each location as a row with an id and the payload data for that >> location as a separate data column. This data column includes the sets of >> ids of the related locations. >> >> Now, I want to be able to update/delete locations consistently. So, in my >> example above, I might want to delete France, in which case I also want to >> make sure that I delete the CONTAINs relationship that Europe has with >> France as that is now obsolete. What is the most efficient way to do this? I >> want to minimise the number of writes I would have to do - on the other hand >> optimising read performance is more important as writes do not happen that >> often (this is geographic data after all). >> >> My thoughts are: I will have to do 1+n writes to do a delete - i.e. 1 write >> operation to delete France and n write operations to delete the >> relationships that n other locations may have to France. In the case of a >> root location like Europe that may have a large number of locations that >> relate to it this may be expensive, but I see no other way. >> >> So, I was wondering, how do I index this to speed this up as far as >> possible. So, given the location Europe, what are the fields I should >> include in its row and how to index them? I could create a column family for >> each relationship type with a label - the label being the id of the location >> this location is related to, so, for e.g., the Europe row would have a >> column called CONTAIN:England (assuming "England" is the id for the England >> column - in reality it would be a UUID). I would then have as many labels >> under the CONTAIN family for Europe as locations that Europe contains. >> >> How would I index this and ensure that when deleting France the query: "list >> all locations that CONTAIN France" returns with Europe (and whatever else) >> as quickly as possible? >> >> Thanks, >> Ishaaq >> >
