Hello all, I am preparing an HDFS deployment on EC2 and have been considering backup strategy as it relates to EBS volumes vs local disk storage.
EBS is nice for its internal replication and snapshot support, but it is not free and uses network bandwidth. Local disk is free and does not use network bandwidth, but is lost if the instance goes down. I would like to minimize EBS usage for cost savings but also ensure that all my data is backed up in EBS snapshots. What if I had a network topology script that put all nodes using an EBS volume in one 'rack' and all nodes using local disk in another? Then data is guaranteed to be persisted to EBS on at least one node. I need to maintain at least 1/replicationFactor of my total raw capacity in EBS. So if my replication is 3, and the capacity of each node is equal, at least 1/3 of my nodes must be using EBS volumes. Now I can backup and restore my cluster using EBS snapshots like normal. It may happen that the EBS nodes, being only 1/3 of the total cluster capacity, will fill up faster than the local disk nodes. I think that running the balancer regularly should ensure that enough space is maintained on the EBS volumes for new blocks. Thus the amount of data stored on the EBS nodes will always be between between 2/3 and 1/3 of the total capacity. Does this sound like a reasonable setup? A couple questions: 1) How well does hadoop handle racks of unequal capacity and different performance parameters? 2) Is the result of the network topology script cached by the namenode? How expensive can the script be? Thanks! Grant