Re: Clarifying the role of HBase Versions

Ryan J. McDonough Tue, 02 Jun 2009 04:16:41 -0700


On Jun 2, 2009, at 1:31 AM, Jonathan Gray wrote:

Ryan,
You are currently only storing the latest nickname, not all 3? I'mtrying
to understand your use case exactly.

Yes, the multiple values are being stored, in fact far more than 3.We've defined the tables to use the max number of versions. Wecurrently can store something to the effect of:


user123=>props:nickname:1243940086:Ryan
user123=>props:nickname:1243940087:Ryan McDonough
user123=>props:nickname:1243940088:Some guy asking questions
user123=>props:nickname:1243940089:Ryan
user123=>props:nickname:1243940090:Ryan
user123=>props:nickname:1243940091:
user123=>props:nickname:1243940092:Ryan McDonough

Where "props" is the column family. One thing that is challenging isthat because the versions are keyed by timestamp, you don't have amechanism to handleduplicate values, thus it's possible to have the same value repeatedmultiple times. Also, you don't have insight into whether or not thevalue was the result of an insert or an accidental dupe, or adeletion. Additionally, we can only evaluate a row filter the mostrecent column value,but IIRC, that's fixed in 0.20.


Whether you want to use versions or not depends on what you want to do
with these multiple values.

Versions are intended for versioning, as in, multiple values for thesame

column that are timestamped and sorted with most recent first.

Yes, I understand that part. But what I'm trying to clarify is whystore versions keyed only by timestamp and not by another arbitraryvalue? As I mentioned in my initial question, I'm starting to seeversions as a means to provide some means of optimistic locking. Toquote the BigTable paper:

"Applications that need to avoid collisions must generate uniquetimestamps themselves. Different versions of a cell are stored indecreasing timestamp order, so that the most recent versions can beread ﬁrst. To make the management of versioned data less onerous, wesupport two per-column-family settings that tell Bigtable to garbage-collect cell versions automatically. The client can specify eitherthat only the last n versions of a cell be kept, or that only new-enough versions be kept (e.g., only keep values that were written inthe last seven days). "

With that said, I'm just trying to get some clarity on how HBaseutilizes versions internally and if there's any change of seeing someunintended consequences of using versions for something other thanversions? For example, does having multiple versions add additionaloverhead at compaction time or when region splits occur?

To put it another way:Based on my current understanding of HBaseversions, I could equate it to using an audit schema in an RDBMS tojoin multiple values. While it's possible, it's not what you'd use anaudit schema for.

It seems from what you said that versions will work nicely. Withthe new
API in the upcoming 0.20, there is much better support dealing with
multiple versions.

Yes, it does work quite nicely, however I just feel like something'swrong with our design. Thanks for the response.


Ryan-

JG

On Mon, June 1, 2009 6:10 pm, Ryan J. McDonough wrote:
I'm trying to get some clarity on the role of versions in HBase. Our
table design is such that a an object can have multiple propertyvalues fora given property name. For example, we could have an nicknamepropertythat a given person is known by. In the current set up, if a personhas 3nicknames, only the last one gets stored. We have considered usingthecolumn versions as an added data dimension, but that just doesn'tfeel
quite right. Given that columns have a limit (granted that it's quite
large) as to how many versions it can store, it's still a limitnone the
less.

From what I gather from reading the BigTable doc, is that version
could be considered a form of optimistic locking so that concurrentwritesdon't conflict. Is that understanding correct? If not, is usingversions
as an added data dimension a good idea?

Ryan-

Re: Clarifying the role of HBase Versions

Reply via email to