I don't think, as a recovery strategy, RAID scales to large amounts of data. 
Even as some kind of attached storage device (e.g. Vtrack), you're only talking 
about a few terabytes of data, and it doesn't tolerate node failure.

A key part of hdfs is the distributed part.

Best,
 -stu
-----Original Message-----
From: Nathan Rutman <nrut...@gmail.com>
Date: Tue, 25 Jan 2011 16:32:07 
To: <hdfs-user@hadoop.apache.org>
Reply-To: hdfs-user@hadoop.apache.org
Subject: Re: HDFS without Hadoop: Why?


On Jan 25, 2011, at 3:56 PM, Gerrit Jansen van Vuuren wrote:

> Hi,
> 
> Why would 3x data seem wasteful? 
> This is exactly what you want.  I would never store any serious business data 
> without some form of replication.

I agree that you want data backup, but 3x replication is the least efficient / 
most expensive (space-wise) way to do it.  This is what RAID was invented for: 
RAID 6 gives you fault tolerance against loss of any two drives, for only 20% 
disk space overhead.  (Sorry, I see I forgot to note this in my original email, 
but that's what I had in mind.) RAID is also not necessarily $ expensive 
either; Linux MD RAID is free and effective.

> What happens if you store a single file on a single server without replicas 
> and that server goes, or just the disk on that the file is on goes ? HDFS and 
> any decent distributed file system uses replication to prevent data loss. As 
> a side affect having the same replica of a data piece on separate servers 
> means that more than one task can work on the server in parallel.

Indeed, replicated data does mean Hadoop could work on the same block on 
separate nodes.  But outside of Hadoop compute jobs, I don't think this is 
useful in general.  And in any case, a distributed filesystem would let you 
work on the same block of data from however many nodes you wanted.


Reply via email to