On Tue, Aug 3, 2010 at 10:42 AM, Brian Bockelman <[email protected]> wrote: > > On Aug 3, 2010, at 9:12 AM, Eric Sammer wrote: > <snip/> > > All of that said, what you're protecting against here is permanent loss of a > data center and human error. Disk, rack, and node level failures are already > handled by HDFS when properly configured. > > You've forgotten a third cause of loss: undiscovered software bugs. > The downside of spinning disks is one completely fatal bug can destroy all > your data in about a minute (at my site, I famously deleted about 100TB in > 10 minutes with a scratch-space cleanup script gone awry. That was one > nasty bug). This is why we keep good backups. > If you're very, very serious about archiving and have a huge budget, you > would invest a few million into a tape silo at multiple sites, flip the > write-protection tab on the tapes, eject them, and send them off to secure > facilities. This isn't for everyone though :) > Brian
Since HDFS filesystems are usually very large backing them up is a challenge in itself. This is actually a financial issue as well as a technical one. A standard DataNode TaskTracker might have hardware like this: 8 1TB disks 4X quad core CPU 32 GB RAM Assuming you are taking the distcp approach you can mirror your cluster with some scripting/coding. However your destination systems can be more modest, assuming you wish to use it ONLY for data no job processing: 8 2TB Disks 1x duel core (AMD for low power consumption) 2 GB RAM (if you an even find this little ram on a server class machine) single power supply (whatever else you can strip off to save $)
