For me it depends on the requirements for the data, but if I'm responsible for it and the data is deemed critical my choice would be HDFS.
One more example: Currently where I work we are thinking of backing up data for at least 5-7 years, suddenly all sorts of issues come into play like disk bit rot with offline backups. Tape?? not a choice where I'm working. We're evaluating using big 48 2TB disk Us with 2 core cpu(s) in an HDFS setup (no mapreduce) to just store the data. HDFS will take care of disks failing and because blocks get calculated on a 3 week cycle for checksums the issue of bit rot is eliminated also. On Thu, Jan 27, 2011 at 3:04 AM, <stu24m...@yahoo.com> wrote: > I believe for most people, the answer is "Yes" > ------------------------------ > *From: * Nathan Rutman <nrut...@gmail.com> > *Date: *Wed, 26 Jan 2011 09:41:37 -0800 > *To: *<hdfs-user@hadoop.apache.org> > *ReplyTo: * hdfs-user@hadoop.apache.org > *Subject: *Re: HDFS without Hadoop: Why? > > Ok. Is your statement, "I use HDFS for general-purpose data storage > because it does this replication well", or is it more, "the most important > benefit of using HDFS as the Map-Reduce or HBase backend fs is data safety." > In other words, I'd like to relate this back to my original question of the > broader usage of HDFS - does it make sense to use HDFS outside of the > special application space for which it was designed? > > > On Jan 26, 2011, at 1:59 AM, Gerrit Jansen van Vuuren wrote: > > Hi, > > For true data durability RAID is not enough. > The conditions I operate on are the following: > > (1) Data loss is not acceptable under any terms > (2) Data unavailability is not acceptable under any terms for any period of > time. > (3) Data loss for certain data sets become a legal issue and is again not > acceptable, an might lead to loss of my employment. > (4) Having 2 nodes fail in a month on average under for volumes we operate > is to be expected, i.e. 100 to 400 nodes per cluster. > (5) Having a data centre outage once a year is to be expected. (We've > already had one this year) > > A word on node failure: Nodes do not just fail because of disks, any > component can fail e.g. RAM, NetworkCard, SCSI controller, CPU etc. > > Now data loss or unavailability can happen under the following conditions: > (1) Multiple of single disk failure > (2) Node failure (a whole U goes down) > (3) Rack failure > (4) Data Centre failure > > Raid covers (1) but I do not know of any raid setup that will cover the > rest. > HDFS with 3 way replication covers 1,2, and 3 but not 4. > HDFS 3 way replication with replication (via distcp) across data centres > covers 1-4. > > The question to ask business is how valuable is the data in question to > them? If they go RAID and only cover (1), they should be asked if its > acceptable to have data unavailable with the possibility of permanent data > loss at any point of time for any amount of data for any amount of time. > If they come back to you and say yes we accept that if a node fails we > loose data or that it becomes unavailable for any period of time, then yes > go for RAID. If the answer is NO, you need replication, even DBAs understand > this and thats why for DBs we backup, replicate and load/fail-over balance, > why should we not do them same for critical business data on file storage? > > > We run all of our nodes non raided (JBOD), because having 3 replicas means > you don't require extra replicas on the same disk or node. > > Yes its true that any distributed file system will make data available to > any number of nodes but this was not my point earlier. Having data replicas > on multiple nodes means that data can be worked from in parallel on multiple > physical nodes without requiring to read/copy the data from a single node. > > Cheers, > Gerrit > > > On Wed, Jan 26, 2011 at 5:54 AM, Dhruba Borthakur <dhr...@gmail.com>wrote: > >> Hi Nathan, >> >> we are using HDFS-RAID for our 30 PB cluster. Most datasets have a >> replication factor of 2.2 and a few datasets have a replication factor of >> 1.4. Some details here: >> >> http://wiki.apache.org/hadoop/HDFS-RAID >> >> http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html >> >> thanks, >> dhruba >> >> >> On Tue, Jan 25, 2011 at 7:58 PM, <stu24m...@yahoo.com> wrote: >> >>> My point was it's not RAID or whatr versus HDFS. HDFS is a distributed >>> file system that solves different problems. >>> >>> >>> HDFS is a file system. It's like asking NTFS or RAID? >>> >>> >but can be generally dealt with using hardware and software failover >>> techniques. >>> >>> Like hdfs. >>> >>> Best, >>> -stu >>> -----Original Message----- >>> From: Nathan Rutman <nrut...@gmail.com> >>> Date: Tue, 25 Jan 2011 17:31:25 >>> To: <hdfs-user@hadoop.apache.org> >>> Reply-To: hdfs-user@hadoop.apache.org >>> Subject: Re: HDFS without Hadoop: Why? >>> >>> >>> On Jan 25, 2011, at 5:08 PM, stu24m...@yahoo.com wrote: >>> >>> > I don't think, as a recovery strategy, RAID scales to large amounts of >>> data. Even as some kind of attached storage device (e.g. Vtrack), you're >>> only talking about a few terabytes of data, and it doesn't tolerate node >>> failure. >>> >>> When talking about large amounts of data, 3x redundancy absolutely >>> doesn't scale. Nobody is going to pay for 3 petabytes worth of disk if they >>> only need 1 PB worth of data. This is where dedicated high-end raid systems >>> come in (this is in fact what my company, Xyratex, builds). Redundant >>> controllers, battery backup, etc. The incremental cost for an additional >>> drive in such systems is negligible. >>> >>> > >>> > A key part of hdfs is the distributed part. >>> >>> Granted, single-point-of-failure arguments are valid when concentrating >>> all the storage together, but can be generally dealt with using hardware and >>> software failover techniques. >>> >>> The scale argument in my mind is exactly reversed -- HDFS works fine for >>> smaller installations that can't afford RAID hardware overhead and access >>> redundancy, and where buying 30 drives instead of 10 is an acceptable cost >>> for the simplicity of HDFS setup. >>> >>> > >>> > Best, >>> > -stu >>> > -----Original Message----- >>> > From: Nathan Rutman <nrut...@gmail.com> >>> > Date: Tue, 25 Jan 2011 16:32:07 >>> > To: <hdfs-user@hadoop.apache.org> >>> > Reply-To: hdfs-user@hadoop.apache.org >>> > Subject: Re: HDFS without Hadoop: Why? >>> > >>> > >>> > On Jan 25, 2011, at 3:56 PM, Gerrit Jansen van Vuuren wrote: >>> > >>> >> Hi, >>> >> >>> >> Why would 3x data seem wasteful? >>> >> This is exactly what you want. I would never store any serious >>> business data without some form of replication. >>> > >>> > I agree that you want data backup, but 3x replication is the least >>> efficient / most expensive (space-wise) way to do it. This is what RAID was >>> invented for: RAID 6 gives you fault tolerance against loss of any two >>> drives, for only 20% disk space overhead. (Sorry, I see I forgot to note >>> this in my original email, but that's what I had in mind.) RAID is also not >>> necessarily $ expensive either; Linux MD RAID is free and effective. >>> > >>> >> What happens if you store a single file on a single server without >>> replicas and that server goes, or just the disk on that the file is on goes >>> ? HDFS and any decent distributed file system uses replication to prevent >>> data loss. As a side affect having the same replica of a data piece on >>> separate servers means that more than one task can work on the server in >>> parallel. >>> > >>> > Indeed, replicated data does mean Hadoop could work on the same block >>> on separate nodes. But outside of Hadoop compute jobs, I don't think this >>> is useful in general. And in any case, a distributed filesystem would let >>> you work on the same block of data from however many nodes you wanted. >>> > >>> > >>> >>> >> >> >> -- >> Connect to me at http://www.facebook.com/dhruba >> > > >