Hello all,

I am preparing an HDFS deployment on EC2 and have been considering backup
strategy as it relates to EBS volumes vs local disk storage.

EBS is nice for its internal replication and snapshot support, but it is not
free and uses network bandwidth.  Local disk is free and does not use
network bandwidth, but is lost if the instance goes down.

I would like to minimize EBS usage for cost savings but also ensure that all
my data is backed up in EBS snapshots.

What if I had a network topology script that put all nodes using an EBS
volume in one 'rack' and all nodes using local disk in another? Then data is
guaranteed to be persisted to EBS on at least one node. I need to maintain
at least 1/replicationFactor of my total raw capacity in EBS.  So if my
replication is 3, and the capacity of each node is equal, at least 1/3 of my
nodes must be using EBS volumes.

Now I can backup and restore my cluster using EBS snapshots like normal.

It may happen that the EBS nodes, being only 1/3 of the total cluster
capacity, will fill up faster than the local disk nodes. I think that
running the balancer regularly should ensure that enough space is maintained
on the EBS volumes for new blocks. Thus the amount of data stored on the EBS
nodes will always be between between 2/3 and 1/3 of the total capacity.

Does this sound like a reasonable setup?

A couple questions:

1) How well does hadoop handle racks of unequal capacity and different
performance parameters?

2) Is the result of the network topology script cached by the namenode? How
expensive can the script be?

Thanks!
Grant

Reply via email to