Re: Backing up HDFS

Edward Capriolo Tue, 03 Aug 2010 08:03:18 -0700

On Tue, Aug 3, 2010 at 10:42 AM, Brian Bockelman <[email protected]> wrote:
>
> On Aug 3, 2010, at 9:12 AM, Eric Sammer wrote:
> <snip/>
>
> All of that said, what you're protecting against here is permanent loss of a
> data center and human error. Disk, rack, and node level failures are already
> handled by HDFS when properly configured.
>
> You've forgotten a third cause of loss: undiscovered software bugs.
> The downside of spinning disks is one completely fatal bug can destroy all
> your data in about a minute (at my site, I famously deleted about 100TB in
> 10 minutes with a scratch-space cleanup script gone awry.  That was one
> nasty bug).  This is why we keep good backups.
> If you're very, very serious about archiving and have a huge budget, you
> would invest a few million into a tape silo at multiple sites, flip the
> write-protection tab on the tapes, eject them, and send them off to secure
> facilities.  This isn't for everyone though :)
> Brian


Since HDFS filesystems are usually very large backing them up is a
challenge in itself. This is actually a financial issue as well as a
technical one. A standard DataNode TaskTracker might have hardware
like this:

8 1TB disks
4X quad core CPU
32 GB RAM

Assuming you are taking the distcp approach you can mirror your
cluster with some scripting/coding. However your destination systems
can be more modest, assuming you wish to use it ONLY for data no job
processing:

8 2TB Disks
1x duel core (AMD for low power consumption)
2 GB RAM (if you an even find this little ram on a server class machine)
single power supply
(whatever else you can strip off to save $)

Re: Backing up HDFS

Reply via email to