On Jun 2, 2009, at 1:31 AM, Jonathan Gray wrote:

Ryan,

You are currently only storing the latest nickname, not all 3? I'm trying
to understand your use case exactly.

Yes, the multiple values are being stored, in fact far more than 3. We've defined the tables to use the max number of versions. We currently can store something to the effect of:

user123=>props:nickname:1243940086:Ryan
user123=>props:nickname:1243940087:Ryan McDonough
user123=>props:nickname:1243940088:Some guy asking questions
user123=>props:nickname:1243940089:Ryan
user123=>props:nickname:1243940090:Ryan
user123=>props:nickname:1243940091:
user123=>props:nickname:1243940092:Ryan McDonough

Where "props" is the column family. One thing that is challenging is that because the versions are keyed by timestamp, you don't have a mechanism to handle duplicate values, thus it's possible to have the same value repeated multiple times. Also, you don't have insight into whether or not the value was the result of an insert or an accidental dupe, or a deletion. Additionally, we can only evaluate a row filter the most recent column value,but IIRC, that's fixed in 0.20.


Whether you want to use versions or not depends on what you want to do
with these multiple values.

Versions are intended for versioning, as in, multiple values for the same
column that are timestamped and sorted with most recent first.

Yes, I understand that part. But what I'm trying to clarify is why store versions keyed only by timestamp and not by another arbitrary value? As I mentioned in my initial question, I'm starting to see versions as a means to provide some means of optimistic locking. To quote the BigTable paper:

"Applications that need to avoid collisions must generate unique timestamps themselves. Different versions of a cell are stored in decreasing timestamp order, so that the most recent versions can be read first. To make the management of versioned data less onerous, we support two per-column-family settings that tell Bigtable to garbage- collect cell versions automatically. The client can specify either that only the last n versions of a cell be kept, or that only new- enough versions be kept (e.g., only keep values that were written in the last seven days). "

With that said, I'm just trying to get some clarity on how HBase utilizes versions internally and if there's any change of seeing some unintended consequences of using versions for something other than versions? For example, does having multiple versions add additional overhead at compaction time or when region splits occur?

To put it another way:Based on my current understanding of HBase versions, I could equate it to using an audit schema in an RDBMS to join multiple values. While it's possible, it's not what you'd use an audit schema for.

It seems from what you said that versions will work nicely. With the new
API in the upcoming 0.20, there is much better support dealing with
multiple versions.

Yes, it does work quite nicely, however I just feel like something's wrong with our design. Thanks for the response.

Ryan-


JG

On Mon, June 1, 2009 6:10 pm, Ryan J. McDonough wrote:
I'm trying to get some clarity on the role of versions in HBase. Our
table design is such that a an object can have multiple property values for a given property name. For example, we could have an nickname property that a given person is known by. In the current set up, if a person has 3 nicknames, only the last one gets stored. We have considered using the column versions as an added data dimension, but that just doesn't feel
quite right. Given that columns have a limit (granted that it's quite
large) as to how many versions it can store, it's still a limit none the
less.

From what I gather from reading the BigTable doc, is that version
could be considered a form of optimistic locking so that concurrent writes don't conflict. Is that understanding correct? If not, is using versions
as an added data dimension a good idea?

Ryan-





Reply via email to