Re: What is Dead Region Servers and how to clear them up?

jeff saremi Sat, 27 May 2017 11:59:32 -0700

Yes. we don't have fixed servers with the exceptions of ZK machines.

We have 3 yarn jobs one for each of master, region, and thrift servers each 
launched separately with different number of nodes. I hope that's not what is 
causing problems.


________________________________
From: Ted Yu <[email protected]>
Sent: Saturday, May 27, 2017 11:27:36 AM
To: [email protected]
Cc: Hbase-User; Yu Li
Subject: Re: What is Dead Region Servers and how to clear them up?

Jeff:
bq. We run our cluster on Yarn and upon restarting jobs in Yarn

Can you clarify a bit more - are you running hbase processes inside Yarn
container ?

Cheers

On Sat, May 27, 2017 at 10:58 AM, jeff saremi <[email protected]>
wrote:

> Thanks @Yu Li<mailto:[email protected]>
>
> You are absolutely correct. Dead RS's will happen regardless. My issue
> with this is more "psychological". If I have done everything needed to be
> done to ensure that RSs are running fine and regions are assigned and such
> and hbck reports are consistent then how is this list of dead region
> servers helping me? other than causing anxiety?
> We run our cluster on Yarn and upon restarting jobs in Yarn we get a lot
> of inconsistent, unavailable regions. (and this is only one scenario). Then
> we'll run hbck with -repair option (and i was wrong here too: hbck does
> take care of some issues) and restart the master(s). After that there seem
> to be no more issues other than dead region servers being still reported.
> We should not have this anymore after having taken all precautions to reset
> the system properly.
>
> If was trying to write something similar to what hbck would do to take
> care of this specific issue. I wouldn't mind contributing to the hbck
> itself either. However I needed to understand where this list comes from
> and why. These are things that I could possibly automate (after all the
> other steps i mentioned):
> - check the ZK list of RS's. If any of the dead RS's found, remove node
>
> - check hdfs root WALs folder. If there are any with the dead RS's name in
> them, delete them. (here we need to take precaution as @Enis mentioned;
> possibly if the node timestamp has not been changed in a while)
>
> - what else? These steps are not enough
>
> For instance, we currently have 17 servers being reported as dead. Only
> 3-4 of them show up in hdfs with "-splitting" in their WALS folder. Where
> do the rest come from?
> thanks
>
> Jeff
>
> ________________________________
> From: Yu Li <[email protected]>
> Sent: Friday, May 26, 2017 10:18:09 PM
> To: Hbase-User
> Cc: [email protected]
> Subject: Re: What is Dead Region Servers and how to clear them up?
>
> bq. And having a list of "dead" servers is not a healthy thing to have.
> I don't think the existence of "dead" servers means the service is
> unhealthy, especially in a distributed system. Besides hbase, HDFS also
> shows Live and Dead nodes in namenode UI, and people won't regard HDFS as
> unhealthy if there're dead nodes.
>
> In HBase, if some RS aborts due to unexpected issue like long GC, normally
> we will restart it and once it's restarted and report to master, it will be
> removed from the dead server list. So when we observed dead server in
> Master UI, the first thing is to check the root cause and restart it if it
> won't cause further issue.
>
> However, sometimes we may find the server aborted due to some hardware
> failure and we must offline the server for repairing. Or we need to move
> some nodes to join other clusters so we stop the RS process on purpose. I
> guess this is the case you're dealing with @jeff? If so, I think it's a
> reasonable requirement that we supply a command in hbase to clear the dead
> nodes when operator assure they no longer serves.
>
> Best Regards,
> Yu
>
> On 27 May 2017 at 04:49, Enis Söztutar <[email protected]> wrote:
>
> > In general if there are no regions in transition, the WAL recovery has
> > already finished. You can watch the master's log4j log for those entries,
> > but the lack of regions in transition is the easiest way to identify.
> >
> > Enis
> >
> > On Fri, May 26, 2017 at 12:14 PM, jeff saremi <[email protected]>
> > wrote:
> >
> > > thanks Enis
> > >
> > > I apologize for earlier
> > >
> > > This looks very close to our issue
> > > When you say: "there is no "WAL" recovery is happening", how could i
> make
> > > sure of that? Thanks
> > >
> > > Jeff
> > >
> > >
> > > ________________________________
> > > From: Enis Söztutar <[email protected]>
> > > Sent: Friday, May 26, 2017 11:47:11 AM
> > > To: [email protected]
> > > Cc: hbase-user
> > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > >
> > > Jeff, please be respectful to be people who are trying to help you.
> This
> > is
> > > not acceptable behavior and will result in consequences next time.
> > >
> > > On the specific issue that you are seeing, it is highly likely that you
> > are
> > > seeing this: https://issues.apache.org/jira/browse/HBASE-14223. Having
> > > those servers in the dead servers list will not hurt operations, or
> > > runtimes or anything else. Possibly for those servers, there is not new
> > > instance of the regionserver running in the same host and ports.
> > >
> > > If you want to manually clean out these, you can follow these steps:
> > >  - Manually move these directries from the file system:
> > > <hbase_hdfs>/WALs/dead-server-splitting
> > >  - ONLY do this if you are sure that there is no "WAL" recovery is
> > > happening, and there is only WAL files with names containing ".meta."
> > >  - Restart HBase master.
> > >
> > > Upon restart, you can see that these do not show up anymore. For more
> > > technical details, please refer to the jira link.
> > >
> > > Enis
> > >
> > > On Fri, May 26, 2017 at 11:03 AM, jeff saremi <[email protected]>
> > > wrote:
> > >
> > > > Thank you for the GFY answer
> > > >
> > > > And i guess to figure out how to fix these I can always go through
> the
> > > > HBase source code.
> > > >
> > > >
> > > > ________________________________
> > > > From: Dima Spivak <[email protected]>
> > > > Sent: Friday, May 26, 2017 9:58:00 AM
> > > > To: hbase-user
> > > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > > >
> > > > Sending this back to the user mailing list.
> > > >
> > > > RegionServers can die for many reasons. Looking at your RegionServer
> > log
> > > > files should give hints as to why it's happening.
> > > >
> > > >
> > > > -Dima
> > > >
> > > > On Fri, May 26, 2017 at 9:48 AM, jeff saremi <[email protected]
> >
> > > > wrote:
> > > >
> > > > > I had posted this to the user mailing list and I have not got any
> > > direct
> > > > > answer to my question.
> > > > >
> > > > > Where do dead RS's come from and how can they be cleaned up?
> Someone
> > in
> > > > > the midst of developers should know this.
> > > > >
> > > > > thanks
> > > > >
> > > > > Jeff
> > > > >
> > > > > ________________________________
> > > > > From: jeff saremi <[email protected]>
> > > > > Sent: Thursday, May 25, 2017 10:23:17 AM
> > > > > To: [email protected]
> > > > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > > > >
> > > > > I'm still looking to get hints on how to remove the dead regions.
> > > thanks
> > > > >
> > > > > ________________________________
> > > > > From: jeff saremi <[email protected]>
> > > > > Sent: Wednesday, May 24, 2017 12:27:06 PM
> > > > > To: [email protected]
> > > > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > > > >
> > > > > i'm trying to eliminate the dead region servers.
> > > > >
> > > > > ________________________________
> > > > > From: Ted Yu <[email protected]>
> > > > > Sent: Wednesday, May 24, 2017 12:17:40 PM
> > > > > To: [email protected]
> > > > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > > > >
> > > > > bq. running hbck (many times
> > > > >
> > > > > Can you describe the specific inconsistencies you were trying to
> > > resolve
> > > > ?
> > > > > Depending on the inconsistencies, advice can be given on the best
> > known
> > > > > hbck command arguments to use.
> > > > >
> > > > > Feel free to pastebin master log if needed.
> > > > >
> > > > > On Wed, May 24, 2017 at 12:10 PM, jeff saremi <
> > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > these are the things I have done so far:
> > > > > >
> > > > > >
> > > > > > - restarting master (few times)
> > > > > >
> > > > > > - running hbck (many times; this tool does not seem to be doing
> > > > anything
> > > > > > at all)
> > > > > >
> > > > > > - checking the list of region servers in ZK (none of the dead
> ones
> > > are
> > > > > > listed here)
> > > > > >
> > > > > > - checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead ones
> > > only 3
> > > > > > are listed here with "-splitting" at the end of their names and
> > they
> > > > > > contain one single file like: 1493846660401..meta.
> > 1493922323600.meta
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > ________________________________
> > > > > > From: jeff saremi <[email protected]>
> > > > > > Sent: Wednesday, May 24, 2017 9:04:11 AM
> > > > > > To: [email protected]
> > > > > > Subject: What is Dead Region Servers and how to clear them up?
> > > > > >
> > > > > > Apparently having dead region servers is so common that a section
> > of
> > > > the
> > > > > > master console is dedicated to that?
> > > > > > How can we clean this up (preferably in an automated fashion)?
> Why
> > > > isn't
> > > > > > this being done by HBase automatically?
> > > > > >
> > > > > >
> > > > > > thanks
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: What is Dead Region Servers and how to clear them up?

Reply via email to