Hello, Gray-san

Thank you. Your explanation was helpful.

Regards
Takayuki


----- Original Message ----- 
From: "Jonathan Gray" <jg...@facebook.com>
To: <hbase-user@hadoop.apache.org>
Sent: Saturday, May 08, 2010 1:54 AM
Subject: RE: How is column timestamp useful?


I would argue that the primary reasons for versioning has nothing to
do with "rescuing users" or being able to recover data.

To reiterate what others have said, the reasons that HBase/BigTable is
versioned is because of the immutable nature of data (an update is a
newer version on top of the old version, not actually an update) and
the original web crawling use case where they wanted to keep
historical information.

As you say, it is certainly possible to model most timestamp-based
schemas without using the built-in versioning (by adding it to the row
key or to the column qualifiers).

But to revisit the crawl example again, imagine our requirement is
that we want to keep the last 10 crawls of every site.  If I was
storing each crawl in a row that included the stamp of the crawl, I
would need my own background process to garbage collect any crawl that
was not one of the 10 most recent.  By utilizing integrated version
limits, I can set maxVersions to be 10, and as a background process
HBase will automatically garbage collect away old crawls beyond the
threshold I set.

As far as pushing timestamps into rows in order to avoid large rows,
this is a fair point, but remember it is the goal of HBase to support
rows with millions of columns and versions (if you are considering
billions of versions in one row, then perhaps this is no longer a sane
use of the integrated versioning).  While this row cannot be split
across two regionservers, often times this is okay or even desirable.
For example, if my row is a userid, I may want a given user to only
live on a single machine rather than be spread across multiple
machines.  Among other reasons, this provides better overall
availability for users as a single machine failure only impacts the
users who live on that machine (if each user was spread across
machines, availability of each machine impacts a much larger
percentage of users).

Hope that helps.

JG


Reply via email to