Hi Harsh, Thanks for the info. If the replication is set to 2, will there be any difference in performance when running MR jobs?
On Wed, Feb 1, 2012 at 1:02 PM, Harsh J <ha...@cloudera.com> wrote: > (Total configured space / replication factor), which if you choose > your values and apply it for the whole FS: ((500 GB x 5) / 3 rep > factor) = (2.5 TB / 3 rep factor) = 833 GB. > > Note, however, that replication is a per-file property and you can > control it granularly instead of keeping it constant FS-wide, if need > be. Use the setrep utility: > http://hadoop.apache.org/common/docs/current/file_system_shell.html#setrep. > For instance, you can keep non-critical files with 1 (none) or 2 > replicas, and all important ones with 3. The calculation of usable > space hence becomes a more complex function. > > Also, for 5 nodes, using a replication factor of two may be okay too. > This will let you bear one DN failure at a time, while 3 will let you > bear two DN failures at the same time (unsure if you'll need that, > since a power or switch loss in your case would mean the whole cluster > going down anyway). You can up the replication factor once you grow > higher, and rebalance the cluster to get it properly functional again. > With rep=2, you should have 1.2 TB worth of usable space. > > On Wed, Feb 1, 2012 at 9:06 AM, Michael Lok <fula...@gmail.com> wrote: >> Hi folks, >> >> We're planning to setup a 5 node hadoop cluster. I'm thinking of just >> setting the dfs.replication to 3; which is the default. Each data node will >> have 500gb of local storage for dfs use. >> >> How do i calculate the amount of usable dfs space given the replication >> setting and the number of nodes in this case? is there a formula which i >> can use? >> >> Any help is greatly appreciated. >> >> Thanks > > > > -- > Harsh J > Customer Ops. Engineer > Cloudera | http://tiny.cloudera.com/about