Hello again, On Fri, Nov 5, 2010 at 12:52 AM, Hari Sreekumar <hsreeku...@clickable.com> wrote: > Hi Harsh, > > Thanks for the reply. So if I have a 2048 MB file with 64 MB > block size (32 blocks) with replication 3, then I'll have 96 blocks of the > file on HDFS, with no two similar blocks being on the same datanode. Also, > if I change the dfs.replication property, does it effect files already in > HDFS or is it valid only for new files that will be uploaded into HDFS? Is > there a way to rebalance the cluster based on the new replication factor?
Replication is file-based. It will not affect existing files, if you restart DataNodes with a different replication factor value. There's a "setrep" command you could use to reset replication value of files. See: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html#setrep > > And if I have replication set to 3, do all the 3 disk writes happen > simultaneously or is there some background process which does the > replication? If not, then increasing replication would lead to more writes > and thus reduce performance of any write-intensive job, am I right? They do not happen simultaneously. The NameNode determines the work of replicating blocks to DataNodes every nth interval, set by "dfs.namenode.replication.interval" to 3s by default. I believe that the load of the nodes are also considered while attempting to assign a replication work between DataNodes. I haven't seen increasing/decreasing replication factors affecting the performance of MapReduce jobs when it comes to writing, but yes I suppose it could, in case of decommissioning a node and/or re-balancing, lower the network transfer rates that the jobs may utilize. -- Harsh J www.harshj.com