Hi,

The thing to remember is that HDFS provides the machine and disk
failure tolerance.  So taking a 3x copy of data is not a bad thing and
not really avoidable - even standard raid systems duplicate data.
Unless you believe a disk will never fail, you are duplicating data at
some level.

As for shipping the version count, many of our users like this ability
and I think it highlights one of the ways the unique storage structure
can be used to do things that are not possible in other data stores.

Also, the version count doesn't mean you will only ever have 1 version
- it just means the maximal # of version returned will be capped and
during daily major compactions (or by admin action) we will only
retain 1 version.  So if you quickly write many versions, you will end
up with that many versions, but they will be hidden.

The API by default only returns the most recent version. When you are
putting, the values are given the latest timestamp. If you do nothing
special you will never realize there are multiple versions.



On Thu, May 6, 2010 at 10:19 PM, Takayuki Tsunakawa
<tsunakawa.ta...@jp.fujitsu.com> wrote:
> Hello, Kevin-san
>
> Yes, Hadoop DFS maintains three copies of the same data (version) at
> the file system level. What I'm wondering about is the necessity of
> different versions of cells by HBase at the database level.
> Amazon SimpleDB, Microsoft Azure Table, and Google App Engine
> Datastore do not provide versioning. So I felt that many people do not
> have to use versioning and the default maximum versions of HBase had
> better be 1.
>
> Regards
> Takayuki
>
>
> ----- Original Message -----
> From: "Kevin Apte" <technicalarchitect2...@gmail.com>
> To: <hbase-user@hadoop.apache.org>
> Sent: Friday, May 07, 2010 1:51 PM
> Subject: Re: How is column timestamp useful?
>
>
>> Hadoop philosophy is to deploy on low cost disks and keep 3 copies
> of data
>> for redundancy. This ensures that the costs are very low- perhaps 5
> to 10
>> times lower than what large Enterprises are paying for expensive SAN
>> configurations.
>>
>> This does not mean one needs to waste storage-  If you store files
>> compressed using gZip, multiple versions of a row may compress very
> well.
>>
>> Kevin
>>
>>
>>
>> On Fri, May 7, 2010 at 10:14 AM, tsuna <tsuna...@gmail.com> wrote:
>>
>>> In addition to what Ryan said, even if the default maximum number
> of
>>> versions for a cell is 3 doesn't mean that you end up wasting
> space.
>>> If you only ever write one version, that's what you end up paying
> for.
>>>
>>> --
>>> Benoit "tsuna" Sigoure
>>> Software Engineer @ www.StumbleUpon.com
>>>
>>
>
>
>

Reply via email to