Re: decommissioning node woes

Ted Dunning Fri, 18 Mar 2011 09:39:22 -0700

Unless the last copy is on that node.

Decommissioning is the only safe way to shut off 10 nodes at once.  Doing
them one at a time and waiting for replication to (asymptotically) recover
is painful and error prone.


On Fri, Mar 18, 2011 at 9:08 AM, James Seigel <[email protected]> wrote:

> Just a note.  If you just shut the node off, the blocks will replicate
> faster.
>
> James.
>
>
> On 2011-03-18, at 10:03 AM, Ted Dunning wrote:
>
> > If nobody else more qualified is willing to jump in, I can at least
> provide
> > some pointers.
> >
> > What you describe is a bit surprising.  I have zero experience with any
> 0.21
> > version, but decommissioning was working well
> > in much older versions, so this would be a surprising regression.
> >
> > The observations you have aren't all inconsistent with how
> decommissioning
> > should work.  The fact that your nodes look up
> > after starting the decommissioning isn't so strange.  The idea is that no
> > new data will be put on the node, nor should it be
> > counted as a replica, but it will help in reading data.
> >
> > So that isn't such a big worry.
> >
> > The fact that it takes forever and a day, however, is a big worry.  I
> cannot
> > provide any help there just off hand.
> >
> > What happens when a datanode goes down?  Do you see under-replicated
> files?
> > Does the number of such files decrease over time?
> >
> > On Fri, Mar 18, 2011 at 4:23 AM, Rita <[email protected]> wrote:
> >
> >> Any help?
> >>
> >>
> >> On Wed, Mar 16, 2011 at 9:36 PM, Rita <[email protected]> wrote:
> >>
> >>> Hello,
> >>>
> >>> I have been struggling with decommissioning data  nodes. I have a 50+
> >> data
> >>> node cluster (no MR) with each server holding about 2TB of storage. I
> >> split
> >>> the nodes into 2 racks.
> >>>
> >>>
> >>> I edit the 'exclude' file and then do a -refreshNodes. I see the node
> >>> immediate in 'Decommiosied node' and I also see it as a 'live' node!
> >>> Eventhough I wait 24+ hours its still like this. I am suspecting its a
> >> bug
> >>> in my version.  The data node process is still running on the node I am
> >>> trying to decommission. So, sometimes I kill -9 the process and I see
> the
> >>> 'under replicated' blocks...this can't be the normal procedure.
> >>>
> >>> There were even times that I had corrupt blocks because I was impatient
> >> --
> >>> waited 24-34 hours
> >>>
> >>> I am using 23 August, 2010: release 0.21.0 <
> >>
> http://hadoop.apache.org/hdfs/releases.html#23+August%2C+2010%3A+release+0.21.0+available
> >>>
> >>> version.
> >>>
> >>> Is this a known bug? Is there anything else I need to do to
> decommission
> >> a
> >>> node?
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> --- Get your facts first, then you can distort them as you please.--
> >>>
> >>
> >>
> >>
> >> --
> >> --- Get your facts first, then you can distort them as you please.--
> >>
>
>

Re: decommissioning node woes

Reply via email to