I don't see anything inherently wrong with your design. On Tue, June 2, 2009 4:16 am, Ryan J. McDonough wrote: >
> On Jun 2, 2009, at 1:31 AM, Jonathan Gray wrote: > > >> Ryan, >> >> >> You are currently only storing the latest nickname, not all 3? I'm >> trying to understand your use case exactly. > > Yes, the multiple values are being stored, in fact far more than 3. > We've defined the tables to use the max number of versions. We > currently can store something to the effect of: > > user123=>props:nickname:1243940086:Ryan > user123=>props:nickname:1243940087:Ryan McDonough > user123=>props:nickname:1243940088:Some guy asking questions > user123=>props:nickname:1243940089:Ryan > user123=>props:nickname:1243940090:Ryan > user123=>props:nickname:1243940091: > user123=>props:nickname:1243940092:Ryan McDonough > > > Where "props" is the column family. One thing that is challenging is > that because the versions are keyed by timestamp, you don't have a > mechanism to handle duplicate values, thus it's possible to have the same > value repeated multiple times. Also, you don't have insight into whether > or not the value was the result of an insert or an accidental dupe, or a > deletion. Additionally, we can only evaluate a row filter the most recent > column value,but IIRC, that's fixed in 0.20. > >> >> Whether you want to use versions or not depends on what you want to do >> with these multiple values. >> >> Versions are intended for versioning, as in, multiple values for the >> same column that are timestamped and sorted with most recent first. > > Yes, I understand that part. But what I'm trying to clarify is why > store versions keyed only by timestamp and not by another arbitrary value? > As I mentioned in my initial question, I'm starting to see > versions as a means to provide some means of optimistic locking. To quote > the BigTable paper: > > "Applications that need to avoid collisions must generate unique > timestamps themselves. Different versions of a cell are stored in > decreasing timestamp order, so that the most recent versions can be read > ï¬rst. To make the management of versioned data less onerous, we support > two per-column-family settings that tell Bigtable to garbage- collect cell > versions automatically. The client can specify either that only the last n > versions of a cell be kept, or that only new- enough versions be kept > (e.g., only keep values that were written in > the last seven days). " > > With that said, I'm just trying to get some clarity on how HBase > utilizes versions internally and if there's any change of seeing some > unintended consequences of using versions for something other than > versions? For example, does having multiple versions add additional > overhead at compaction time or when region splits occur? > > To put it another way:Based on my current understanding of HBase > versions, I could equate it to using an audit schema in an RDBMS to join > multiple values. While it's possible, it's not what you'd use an audit > schema for. > >> It seems from what you said that versions will work nicely. With >> the new API in the upcoming 0.20, there is much better support dealing >> with multiple versions. > > Yes, it does work quite nicely, however I just feel like something's > wrong with our design. Thanks for the response. > > Ryan- > > >> >> JG >> >> >> On Mon, June 1, 2009 6:10 pm, Ryan J. McDonough wrote: >> >>> I'm trying to get some clarity on the role of versions in HBase. Our >>> table design is such that a an object can have multiple property values >>> for a given property name. For example, we could have an nickname >>> property that a given person is known by. In the current set up, if a >>> person has 3 nicknames, only the last one gets stored. We have >>> considered using the column versions as an added data dimension, but >>> that just doesn't feel quite right. Given that columns have a limit >>> (granted that it's quite >>> large) as to how many versions it can store, it's still a limit none >>> the less. >>> >>> From what I gather from reading the BigTable doc, is that version >>> could be considered a form of optimistic locking so that concurrent >>> writes don't conflict. Is that understanding correct? If not, is using >>> versions as an added data dimension a good idea? >>> >>> Ryan- >>> >>> >>> >>> >> > >
