(Total configured space / replication factor), which if you choose your values and apply it for the whole FS: ((500 GB x 5) / 3 rep factor) = (2.5 TB / 3 rep factor) = 833 GB.
Note, however, that replication is a per-file property and you can control it granularly instead of keeping it constant FS-wide, if need be. Use the setrep utility: http://hadoop.apache.org/common/docs/current/file_system_shell.html#setrep. For instance, you can keep non-critical files with 1 (none) or 2 replicas, and all important ones with 3. The calculation of usable space hence becomes a more complex function. Also, for 5 nodes, using a replication factor of two may be okay too. This will let you bear one DN failure at a time, while 3 will let you bear two DN failures at the same time (unsure if you'll need that, since a power or switch loss in your case would mean the whole cluster going down anyway). You can up the replication factor once you grow higher, and rebalance the cluster to get it properly functional again. With rep=2, you should have 1.2 TB worth of usable space. On Wed, Feb 1, 2012 at 9:06 AM, Michael Lok <fula...@gmail.com> wrote: > Hi folks, > > We're planning to setup a 5 node hadoop cluster. I'm thinking of just > setting the dfs.replication to 3; which is the default. Each data node will > have 500gb of local storage for dfs use. > > How do i calculate the amount of usable dfs space given the replication > setting and the number of nodes in this case? is there a formula which i > can use? > > Any help is greatly appreciated. > > Thanks -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about