Re: indexing question

Ryan Rawson Fri, 03 Jul 2009 00:46:56 -0700

Hey,

HBase doesn't provide declarative secondary indexes.  Your app code
needs to maintain them, writing into 2 tables with dual writes.  You
don't have to duplicate data, you can just use the secondary index as
a pointer into the main table, causing you to have to chase down
potentially thousands of extra RPCs. There are no hbase transactions
when you are modifying multiple tables, but that isnt as big of a
problem as it seems.


If all this sounds very 'raw' and 'painful' and 'error prone', let me
remind you what HBase is for, and perhaps you can make a better
choice.

HBase is when you hit the limits of what you can do with mysql.  When
you work to scale mysql you end up removing the following features:
- no transactions
- no secondary indexes (slow on mysql/innodb)
- separate multiple table indexes on different databases
- sharding (last step)

Once you hit the magical 300-500GB size and you have hit the end of
where master-slave replication scaling can take you, you need to move
on to different techniques and technology.  This is where HBase picks
up.

So all the things you list below as 'negatives' are the reality on the
ground when you scale no matter what technology you use.  If they
sound too ugly for you, perhaps you really need mysql?


On Fri, Jul 3, 2009 at 12:37 AM, tim robertson<[email protected]> wrote:
> Those 2 tables could be collapsed into 1 table with 2 columns of course...
>
> On Fri, Jul 3, 2009 at 9:24 AM, tim robertson<[email protected]> 
> wrote:
>> Hi,
>>
>> Disclaimer: I am a newbie, so this is just one option, and I am basing
>> on my understanding that secondary indexes are not yet working on
>> HBase...
>>
>> So since HBase has very fast "get by primary key", but is *still* (?)
>> without working secondary indexes, you would need to do scans to find
>> the records.  A workaround would be to have 2 more tables
>> "Country_Contains" and "Country_Contained_In", and in each table, the
>> primary key is the unique ID of the country, the payload being the
>> Keys to the rows in the main table.  Basically this is creating 2
>> tables to act as the index manually.  This is a duplication of data,
>> and would require management of 3 tables wrapped in a transaction when
>> doing CRUD, but it would allow for lookup of the rows to modify
>> without need for scanning.
>>
>> Just one idea...
>>
>> Cheers,
>>
>> Tim
>>
>>
>>
>>
>> On Fri, Jul 3, 2009 at 9:10 AM, Ishaaq Chandy<[email protected]> wrote:
>>> Hi all,
>>> I am pretty new to HBase so forgive me if this seems like a silly question.
>>>
>>> Each row in my Hbase table is a geographical location that is related to
>>> other locations. For e.g. one relationship is the CONTAIN relationship. So,
>>> Europe CONTAINs  England, France, Spain etc. There is an inverse
>>> relationship as well called PARENT, so England has a PARENT called Europe.
>>> However, note that, for various business reasons not pertinant to this
>>> discussion, the inverse relationship need not always be set, i.e. we may not
>>> store France with a PARENT value of Europe, even though Europe CONTAINs
>>> France.
>>>
>>> So, I store each location as a row with an id and the payload data for that
>>> location as a separate data column. This data column includes the sets of
>>> ids of the related locations.
>>>
>>> Now, I want to be able to update/delete locations consistently. So, in my
>>> example above, I might want to delete France, in which case I also want to
>>> make sure that I delete the CONTAINs relationship that Europe has with
>>> France as that is now obsolete. What is the most efficient way to do this? I
>>> want to minimise the number of writes I would have to do - on the other hand
>>> optimising read performance is more important as writes do not happen that
>>> often (this is geographic data after all).
>>>
>>> My thoughts are: I will have to do 1+n writes to do a delete - i.e. 1 write
>>> operation to delete France and n write operations to delete the
>>> relationships that n other locations may have to France. In the case of a
>>> root location like Europe that may have a large number of locations that
>>> relate to it this may be expensive, but I see no other way.
>>>
>>> So, I was wondering, how do I index this to speed this up as far as
>>> possible. So, given the location Europe, what are the fields I should
>>> include in its row and how to index them? I could create a column family for
>>> each relationship type with a label - the label being the id of the location
>>> this location is related to, so, for e.g., the Europe row would have a
>>> column called CONTAIN:England (assuming "England" is the id for the England
>>> column - in reality it would be a UUID). I would then have as many labels
>>> under the CONTAIN family for Europe as locations that Europe contains.
>>>
>>> How would I index this and ensure that when deleting France the query: "list
>>> all locations that CONTAIN France" returns with Europe (and whatever else)
>>> as quickly as possible?
>>>
>>> Thanks,
>>> Ishaaq
>>>
>>
>

Re: indexing question

Reply via email to