It could be that your process has hung cause a particular resident block (file) requires a very large replication factor, and your remaining # of nodes is less than that value. This is a genuine reason for hang (but must be fixed). The process usually waits until there are no under-replicated blocks, so I'd use fsck to check if any such ones are present and setrep them to a lower value.
On Fri, Aug 12, 2011 at 9:28 PM, <[email protected]> wrote: > Hi All, > > I'm trying to decommission data node from my cluster. I put the data node in > the /usr/lib/hadoop/conf/dfs.hosts.exclude list and restarted the name nodes. > The under-replicated blocks are starting to replicate, but it's going down > in a very slow pace. For 1 TB of data it takes over 1 day to complete. We > change the settings as below and try to increase the replication rate. > > Added this to hdfs-site.xml on all the nodes on the cluster and restarted the > data nodes and name node processes. > <property> > <!-- 100Mbit/s --> > <name>dfs.balance.bandwidthPerSec</name> > <value>131072000</value> > </property> > > Speed didn't seem to pick up. Do you know what may be happening? > > Thanks! > Jonathan > > This message is for the designated recipient only and may contain privileged, > proprietary, or otherwise private information. If you have received it in > error, please notify the sender immediately and delete the original. Any > other use of the email by you is prohibited. > -- Harsh J
