I would argue that the primary reasons for versioning has nothing to do with 
"rescuing users" or being able to recover data.

To reiterate what others have said, the reasons that HBase/BigTable is 
versioned is because of the immutable nature of data (an update is a newer 
version on top of the old version, not actually an update) and the original web 
crawling use case where they wanted to keep historical information.

As you say, it is certainly possible to model most timestamp-based schemas 
without using the built-in versioning (by adding it to the row key or to the 
column qualifiers).

But to revisit the crawl example again, imagine our requirement is that we want 
to keep the last 10 crawls of every site.  If I was storing each crawl in a row 
that included the stamp of the crawl, I would need my own background process to 
garbage collect any crawl that was not one of the 10 most recent.  By utilizing 
integrated version limits, I can set maxVersions to be 10, and as a background 
process HBase will automatically garbage collect away old crawls beyond the 
threshold I set.

As far as pushing timestamps into rows in order to avoid large rows, this is a 
fair point, but remember it is the goal of HBase to support rows with millions 
of columns and versions (if you are considering billions of versions in one 
row, then perhaps this is no longer a sane use of the integrated versioning).  
While this row cannot be split across two regionservers, often times this is 
okay or even desirable.  For example, if my row is a userid, I may want a given 
user to only live on a single machine rather than be spread across multiple 
machines.  Among other reasons, this provides better overall availability for 
users as a single machine failure only impacts the users who live on that 
machine (if each user was spread across machines, availability of each machine 
impacts a much larger percentage of users).

Hope that helps.

JG

> -----Original Message-----
> From: Takayuki Tsunakawa [mailto:tsunakawa.ta...@jp.fujitsu.com]
> Sent: Friday, May 07, 2010 12:04 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: How is column timestamp useful?
> 
> All,
> 
> Thank you for giving lots of opinions and information. I'll try to
> persuade my colleagues as follows:
> 
> I couldn't find any good examples where versioning should be
> definitely utilized. However, HBase community members gave me the idea
> on how versioning is useful.
> 1. Recover data lost by accidental deletions or updates
>    (I think this is the most persuading reason)
> 2. Auditing (change tracking) )for compliance
>    However, this is not persuading, because advanced RDBMSs provide
> audit trails, not versioning. Versioning itself does not show who
> changed the data how.
> 3. Recording events (as in Google's persolalized search)
>    This is not persuading, too. As I wrote in the previous mail,
> embedding time of event in row key may be better because it prevent
> the rows from becoming big.
> 
> If versioning is not necessary from your requirement, you can ignore
> timestamps (do not have to specify timestamp in API call).
> Although HBase keeps three versions by default and it may be a bit
> wasteful for memory and disk, turning on compression for column
> families can minimize the waste as much as you can ignore (is it
> true?).
> If saving memory (=keep memtable as small as possible) is important,
> you can set the maximum number of versions to 1.
> The reason that the default is 3 is to rescue users from their
> mistakes.
> (If users accidentally delete or update data, you have to develop a
> tool that pulls previous data records.)
> 
> Regards
> Takayuki
> 
> 
> 
> 
> 

Reply via email to