Hi folks, I'm still a noob in the hadoop world, so I apologize if this is already asked and answered. This thread seems pretty recent, so hopefully it's OK if I jump in. I trust folks to politely correct me if I'm way off base. (This is not really a question per se, but more a request for comments/feedback.
This question from M.Shiva spawned a discussion about the difference between replication, and backup/restore. > On 12/19/07 11:17 PM, "M.Shiva" wrote: >> > 5.Can we take backup and restore the files written to hadoop >On 12/20/07 12:05 AM, "Ted Dunning" wrote: > Obviously, yes. > > But, again, the point of hadoop's file system is that it makes this largely > unnecessary because of file replication. > From: Joydeep Sen Sarma, Thu, 20 Dec 2007 08:53:38 -0800 > > agreed - i think for anyone who is thinking of using hadoop as a place from > where data is served - has to be disturbed by lack of data protection. > > replication in hadoop provides protection against hardware failures. not > software failures. backups (and depending on how they are implemented - > snapshots) protect against errant software. we have seen evidence of the > namenode going haywire and causing block deletions/file corruptions at least > once. we have seen more reports of the same nature on this list. i don't think > hadoop (and hbase) can reach their full potential without a safeguard against > software corruptions. > > (i don't think the traditional notion of backing up to tape (or even virtual > tape - which is really what our filers are becoming) is worth discussing. for > large data sets - the restore time would be so bad as to render these useless > as a recovery path). I think both answers are right, in that replication provides protection against most "normal" failures and that certain not-unheard-of events can still cause catastrophic data loss. These might include software problems, multiple simultaneous failures of nodes, or a malicious insider deciding to rm files. I would agree with Joydeep that it's a little disturbing. In its current design hadoop's dfs is probably not well-suited to applications where losing the data would mean losing your business, or even significant revenue. Levels of data protection beyond just replication and checksums are probably not even in the original design goals. After all it's originally about distributed computing, right? There are some types of files that I don't care if I lose, and there are others for which replication level 4 would still not be enough. After all, if I lose power to a datacenter and I'm using older disks, it's probably not unexpected that a dozen or more disks wouldn't come back, and if the cluster is well-balanced *some* number of my important blocks will lose all replicas. At the other end of the spectrum, there are also some files for which I'd like some very basic level of replication that's not 2x or 3x the cost of no replication at all. If a GB of disk costs me $0.30 (0.50 if you count the servers, switches, etc), then having replication set to 2 costs me 1.00 and replication level 3 costs me 1.50. But, what if every 5th block was a parity block that could be used to reconstruct any one of 4 other blocks? That means I'm still protected against losing 1 node, but at a total cost of 0.60 instead of 1.00. (Some smart programmer might even write a map/reduce program to replace the failed blocks quickly :) Losing 2 or more nodes at once would still mean a chance that some blocks will lose a replica and its companion, or a replica and its parity, but perhaps that level of risk is acceptable for some applications. > this question came up a couple of days back as well. one option is switching > over to solaris+zfs as a way of taking data snapshots. the other option is > having two hdfs instances (ideally running different versions) and replicating > data amongst them. both have clear downsides. Since you mentioned ZFS, I went and looked at it today, and it definitely is all kinds of cool. ZFS is an excellent example of a robust, feature-rich filesystem, at least if it does what it's documentation claims it does. It looks like ZFS can take arbitrarily large numbers of disks and combine them into huge storage, for large numbers of huge files, and it also has checksumming built in. It also has replication (ditto blocks), snapshots and clones and other sexy features like RAID-Z (which I'm calling "Replication level 1.2") (A lot has been written about it but I found the wikipedia entry most useful for a quick overview: http://en.wikipedia.org/wiki/ZFS) The big thing I want it to have and it doesn't is "being distributed". It can handle many disks (theoretically billions) as long as they're attached to the same Solaris kernel (yeah right). Meanwhile, Hadoop *is* distributed, and is really great at moving large numbers of large blocks around among large numbers of nodes. So I'm thinking, we really need to get these two together... I think they would get along famously. I would really *love* to see Hadoop pick up some of the same features, especially snapshot/clone and parity blocks. I'm guessing it won't do so in the near future, but hopefully some other product will come along soon that does for the distributed-storage world what ZFS does for the single machine.
