Unless the last copy is on that node. Decommissioning is the only safe way to shut off 10 nodes at once. Doing them one at a time and waiting for replication to (asymptotically) recover is painful and error prone.
On Fri, Mar 18, 2011 at 9:08 AM, James Seigel <[email protected]> wrote: > Just a note. If you just shut the node off, the blocks will replicate > faster. > > James. > > > On 2011-03-18, at 10:03 AM, Ted Dunning wrote: > > > If nobody else more qualified is willing to jump in, I can at least > provide > > some pointers. > > > > What you describe is a bit surprising. I have zero experience with any > 0.21 > > version, but decommissioning was working well > > in much older versions, so this would be a surprising regression. > > > > The observations you have aren't all inconsistent with how > decommissioning > > should work. The fact that your nodes look up > > after starting the decommissioning isn't so strange. The idea is that no > > new data will be put on the node, nor should it be > > counted as a replica, but it will help in reading data. > > > > So that isn't such a big worry. > > > > The fact that it takes forever and a day, however, is a big worry. I > cannot > > provide any help there just off hand. > > > > What happens when a datanode goes down? Do you see under-replicated > files? > > Does the number of such files decrease over time? > > > > On Fri, Mar 18, 2011 at 4:23 AM, Rita <[email protected]> wrote: > > > >> Any help? > >> > >> > >> On Wed, Mar 16, 2011 at 9:36 PM, Rita <[email protected]> wrote: > >> > >>> Hello, > >>> > >>> I have been struggling with decommissioning data nodes. I have a 50+ > >> data > >>> node cluster (no MR) with each server holding about 2TB of storage. I > >> split > >>> the nodes into 2 racks. > >>> > >>> > >>> I edit the 'exclude' file and then do a -refreshNodes. I see the node > >>> immediate in 'Decommiosied node' and I also see it as a 'live' node! > >>> Eventhough I wait 24+ hours its still like this. I am suspecting its a > >> bug > >>> in my version. The data node process is still running on the node I am > >>> trying to decommission. So, sometimes I kill -9 the process and I see > the > >>> 'under replicated' blocks...this can't be the normal procedure. > >>> > >>> There were even times that I had corrupt blocks because I was impatient > >> -- > >>> waited 24-34 hours > >>> > >>> I am using 23 August, 2010: release 0.21.0 < > >> > http://hadoop.apache.org/hdfs/releases.html#23+August%2C+2010%3A+release+0.21.0+available > >>> > >>> version. > >>> > >>> Is this a known bug? Is there anything else I need to do to > decommission > >> a > >>> node? > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> -- > >>> --- Get your facts first, then you can distort them as you please.-- > >>> > >> > >> > >> > >> -- > >> --- Get your facts first, then you can distort them as you please.-- > >> > >
