Hello, Gray-san Thank you. Your explanation was helpful.
Regards Takayuki ----- Original Message ----- From: "Jonathan Gray" <jg...@facebook.com> To: <hbase-user@hadoop.apache.org> Sent: Saturday, May 08, 2010 1:54 AM Subject: RE: How is column timestamp useful? I would argue that the primary reasons for versioning has nothing to do with "rescuing users" or being able to recover data. To reiterate what others have said, the reasons that HBase/BigTable is versioned is because of the immutable nature of data (an update is a newer version on top of the old version, not actually an update) and the original web crawling use case where they wanted to keep historical information. As you say, it is certainly possible to model most timestamp-based schemas without using the built-in versioning (by adding it to the row key or to the column qualifiers). But to revisit the crawl example again, imagine our requirement is that we want to keep the last 10 crawls of every site. If I was storing each crawl in a row that included the stamp of the crawl, I would need my own background process to garbage collect any crawl that was not one of the 10 most recent. By utilizing integrated version limits, I can set maxVersions to be 10, and as a background process HBase will automatically garbage collect away old crawls beyond the threshold I set. As far as pushing timestamps into rows in order to avoid large rows, this is a fair point, but remember it is the goal of HBase to support rows with millions of columns and versions (if you are considering billions of versions in one row, then perhaps this is no longer a sane use of the integrated versioning). While this row cannot be split across two regionservers, often times this is okay or even desirable. For example, if my row is a userid, I may want a given user to only live on a single machine rather than be spread across multiple machines. Among other reasons, this provides better overall availability for users as a single machine failure only impacts the users who live on that machine (if each user was spread across machines, availability of each machine impacts a much larger percentage of users). Hope that helps. JG