Re: Clarifying the role of HBase Versions

Jonathan Gray Tue, 02 Jun 2009 11:42:33 -0700

I don't see anything inherently wrong with your design.

On Tue, June 2, 2009 4:16 am, Ryan J. McDonough wrote:
>


> On Jun 2, 2009, at 1:31 AM, Jonathan Gray wrote:
>
>
>> Ryan,
>>
>>
>> You are currently only storing the latest nickname, not all 3?  I'm
>> trying to understand your use case exactly.
>
> Yes, the multiple values are being stored, in fact far more than 3.
> We've defined the tables to use the max number of versions. We
> currently can store something to the effect of:
>
> user123=>props:nickname:1243940086:Ryan
> user123=>props:nickname:1243940087:Ryan McDonough
> user123=>props:nickname:1243940088:Some guy asking questions
> user123=>props:nickname:1243940089:Ryan
> user123=>props:nickname:1243940090:Ryan
> user123=>props:nickname:1243940091:
> user123=>props:nickname:1243940092:Ryan McDonough
>
>
> Where "props" is the column family. One thing that is challenging is
> that because the versions are keyed by timestamp, you don't have a
> mechanism to handle duplicate values, thus it's possible to have the same
> value repeated multiple times. Also, you don't have insight into whether
> or not the value was the result of an insert or an accidental dupe, or a
> deletion. Additionally, we can only evaluate a row filter the most recent
> column value,but IIRC, that's fixed in 0.20.
>
>>
>> Whether you want to use versions or not depends on what you want to do
>> with these multiple values.
>>
>> Versions are intended for versioning, as in, multiple values for the
>> same column that are timestamped and sorted with most recent first.
>
> Yes, I understand that part. But what I'm trying to clarify is why
> store versions keyed only by timestamp and not by another arbitrary value?
> As I mentioned in my initial question, I'm starting to see
> versions as a means to provide some means of optimistic locking. To quote
> the BigTable paper:
>
> "Applications that need to avoid collisions must generate unique
> timestamps themselves. Different versions of a cell are stored in
> decreasing timestamp order, so that the most recent versions can be read
> ï¬rst.  To make the management of versioned data less onerous, we support
> two per-column-family settings that tell Bigtable to garbage- collect cell
> versions automatically. The client can specify either that only the last n
> versions of a cell be kept, or that only new- enough versions be kept
> (e.g., only keep values that were written in
> the last seven days). "
>
> With that said, I'm just trying to get some clarity on how HBase
> utilizes versions internally and if there's any change of seeing some
> unintended consequences of using versions for something other than
> versions? For example, does having multiple versions add additional
> overhead at compaction time or when region splits occur?
>
> To put it another way:Based on my current understanding of HBase
> versions, I could equate it to using an audit schema in an RDBMS to join
> multiple values. While it's possible, it's not what you'd use an audit
> schema for.
>
>> It seems from what you said that versions will work nicely.  With
>> the new API in the upcoming 0.20, there is much better support dealing
>> with multiple versions.
>
> Yes, it does work quite nicely, however I just feel like something's
> wrong with our design. Thanks for the response.
>
> Ryan-
>
>
>>
>> JG
>>
>>
>> On Mon, June 1, 2009 6:10 pm, Ryan J. McDonough wrote:
>>
>>> I'm trying to get some clarity on the role of versions in HBase. Our
>>> table design is such that a an object can have multiple property values
>>> for a given property name. For example, we could have an nickname
>>> property that a given person is known by. In the current set up, if a
>>> person has 3 nicknames, only the last one gets stored. We have
>>> considered using the column versions as an added data dimension, but
>>> that just doesn't feel quite right. Given that columns have a limit
>>> (granted that it's quite
>>> large) as to how many versions it can store, it's still a limit none
>>> the less.
>>>
>>> From what I gather from reading the BigTable doc, is that version
>>> could be considered a form of optimistic locking so that concurrent
>>> writes don't conflict. Is that understanding correct? If not, is using
>>>  versions as an added data dimension a good idea?
>>>
>>> Ryan-
>>>
>>>
>>>
>>>
>>
>
>

Re: Clarifying the role of HBase Versions

Reply via email to