Yes. we don't have fixed servers with the exceptions of ZK machines. We have 3 yarn jobs one for each of master, region, and thrift servers each launched separately with different number of nodes. I hope that's not what is causing problems.
________________________________ From: Ted Yu <yuzhih...@gmail.com> Sent: Saturday, May 27, 2017 11:27:36 AM To: dev@hbase.apache.org Cc: Hbase-User; Yu Li Subject: Re: What is Dead Region Servers and how to clear them up? Jeff: bq. We run our cluster on Yarn and upon restarting jobs in Yarn Can you clarify a bit more - are you running hbase processes inside Yarn container ? Cheers On Sat, May 27, 2017 at 10:58 AM, jeff saremi <jeffsar...@hotmail.com> wrote: > Thanks @Yu Li<mailto:car...@gmail.com> > > You are absolutely correct. Dead RS's will happen regardless. My issue > with this is more "psychological". If I have done everything needed to be > done to ensure that RSs are running fine and regions are assigned and such > and hbck reports are consistent then how is this list of dead region > servers helping me? other than causing anxiety? > We run our cluster on Yarn and upon restarting jobs in Yarn we get a lot > of inconsistent, unavailable regions. (and this is only one scenario). Then > we'll run hbck with -repair option (and i was wrong here too: hbck does > take care of some issues) and restart the master(s). After that there seem > to be no more issues other than dead region servers being still reported. > We should not have this anymore after having taken all precautions to reset > the system properly. > > If was trying to write something similar to what hbck would do to take > care of this specific issue. I wouldn't mind contributing to the hbck > itself either. However I needed to understand where this list comes from > and why. These are things that I could possibly automate (after all the > other steps i mentioned): > - check the ZK list of RS's. If any of the dead RS's found, remove node > > - check hdfs root WALs folder. If there are any with the dead RS's name in > them, delete them. (here we need to take precaution as @Enis mentioned; > possibly if the node timestamp has not been changed in a while) > > - what else? These steps are not enough > > For instance, we currently have 17 servers being reported as dead. Only > 3-4 of them show up in hdfs with "-splitting" in their WALS folder. Where > do the rest come from? > thanks > > Jeff > > ________________________________ > From: Yu Li <car...@gmail.com> > Sent: Friday, May 26, 2017 10:18:09 PM > To: Hbase-User > Cc: dev@hbase.apache.org > Subject: Re: What is Dead Region Servers and how to clear them up? > > bq. And having a list of "dead" servers is not a healthy thing to have. > I don't think the existence of "dead" servers means the service is > unhealthy, especially in a distributed system. Besides hbase, HDFS also > shows Live and Dead nodes in namenode UI, and people won't regard HDFS as > unhealthy if there're dead nodes. > > In HBase, if some RS aborts due to unexpected issue like long GC, normally > we will restart it and once it's restarted and report to master, it will be > removed from the dead server list. So when we observed dead server in > Master UI, the first thing is to check the root cause and restart it if it > won't cause further issue. > > However, sometimes we may find the server aborted due to some hardware > failure and we must offline the server for repairing. Or we need to move > some nodes to join other clusters so we stop the RS process on purpose. I > guess this is the case you're dealing with @jeff? If so, I think it's a > reasonable requirement that we supply a command in hbase to clear the dead > nodes when operator assure they no longer serves. > > Best Regards, > Yu > > On 27 May 2017 at 04:49, Enis Söztutar <enis....@gmail.com> wrote: > > > In general if there are no regions in transition, the WAL recovery has > > already finished. You can watch the master's log4j log for those entries, > > but the lack of regions in transition is the easiest way to identify. > > > > Enis > > > > On Fri, May 26, 2017 at 12:14 PM, jeff saremi <jeffsar...@hotmail.com> > > wrote: > > > > > thanks Enis > > > > > > I apologize for earlier > > > > > > This looks very close to our issue > > > When you say: "there is no "WAL" recovery is happening", how could i > make > > > sure of that? Thanks > > > > > > Jeff > > > > > > > > > ________________________________ > > > From: Enis Söztutar <enis....@gmail.com> > > > Sent: Friday, May 26, 2017 11:47:11 AM > > > To: dev@hbase.apache.org > > > Cc: hbase-user > > > Subject: Re: What is Dead Region Servers and how to clear them up? > > > > > > Jeff, please be respectful to be people who are trying to help you. > This > > is > > > not acceptable behavior and will result in consequences next time. > > > > > > On the specific issue that you are seeing, it is highly likely that you > > are > > > seeing this: https://issues.apache.org/jira/browse/HBASE-14223. Having > > > those servers in the dead servers list will not hurt operations, or > > > runtimes or anything else. Possibly for those servers, there is not new > > > instance of the regionserver running in the same host and ports. > > > > > > If you want to manually clean out these, you can follow these steps: > > > - Manually move these directries from the file system: > > > <hbase_hdfs>/WALs/dead-server-splitting > > > - ONLY do this if you are sure that there is no "WAL" recovery is > > > happening, and there is only WAL files with names containing ".meta." > > > - Restart HBase master. > > > > > > Upon restart, you can see that these do not show up anymore. For more > > > technical details, please refer to the jira link. > > > > > > Enis > > > > > > On Fri, May 26, 2017 at 11:03 AM, jeff saremi <jeffsar...@hotmail.com> > > > wrote: > > > > > > > Thank you for the GFY answer > > > > > > > > And i guess to figure out how to fix these I can always go through > the > > > > HBase source code. > > > > > > > > > > > > ________________________________ > > > > From: Dima Spivak <dimaspi...@apache.org> > > > > Sent: Friday, May 26, 2017 9:58:00 AM > > > > To: hbase-user > > > > Subject: Re: What is Dead Region Servers and how to clear them up? > > > > > > > > Sending this back to the user mailing list. > > > > > > > > RegionServers can die for many reasons. Looking at your RegionServer > > log > > > > files should give hints as to why it's happening. > > > > > > > > > > > > -Dima > > > > > > > > On Fri, May 26, 2017 at 9:48 AM, jeff saremi <jeffsar...@hotmail.com > > > > > > wrote: > > > > > > > > > I had posted this to the user mailing list and I have not got any > > > direct > > > > > answer to my question. > > > > > > > > > > Where do dead RS's come from and how can they be cleaned up? > Someone > > in > > > > > the midst of developers should know this. > > > > > > > > > > thanks > > > > > > > > > > Jeff > > > > > > > > > > ________________________________ > > > > > From: jeff saremi <jeffsar...@hotmail.com> > > > > > Sent: Thursday, May 25, 2017 10:23:17 AM > > > > > To: u...@hbase.apache.org > > > > > Subject: Re: What is Dead Region Servers and how to clear them up? > > > > > > > > > > I'm still looking to get hints on how to remove the dead regions. > > > thanks > > > > > > > > > > ________________________________ > > > > > From: jeff saremi <jeffsar...@hotmail.com> > > > > > Sent: Wednesday, May 24, 2017 12:27:06 PM > > > > > To: u...@hbase.apache.org > > > > > Subject: Re: What is Dead Region Servers and how to clear them up? > > > > > > > > > > i'm trying to eliminate the dead region servers. > > > > > > > > > > ________________________________ > > > > > From: Ted Yu <yuzhih...@gmail.com> > > > > > Sent: Wednesday, May 24, 2017 12:17:40 PM > > > > > To: u...@hbase.apache.org > > > > > Subject: Re: What is Dead Region Servers and how to clear them up? > > > > > > > > > > bq. running hbck (many times > > > > > > > > > > Can you describe the specific inconsistencies you were trying to > > > resolve > > > > ? > > > > > Depending on the inconsistencies, advice can be given on the best > > known > > > > > hbck command arguments to use. > > > > > > > > > > Feel free to pastebin master log if needed. > > > > > > > > > > On Wed, May 24, 2017 at 12:10 PM, jeff saremi < > > jeffsar...@hotmail.com> > > > > > wrote: > > > > > > > > > > > these are the things I have done so far: > > > > > > > > > > > > > > > > > > - restarting master (few times) > > > > > > > > > > > > - running hbck (many times; this tool does not seem to be doing > > > > anything > > > > > > at all) > > > > > > > > > > > > - checking the list of region servers in ZK (none of the dead > ones > > > are > > > > > > listed here) > > > > > > > > > > > > - checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead ones > > > only 3 > > > > > > are listed here with "-splitting" at the end of their names and > > they > > > > > > contain one single file like: 1493846660401..meta. > > 1493922323600.meta > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ________________________________ > > > > > > From: jeff saremi <jeffsar...@hotmail.com> > > > > > > Sent: Wednesday, May 24, 2017 9:04:11 AM > > > > > > To: u...@hbase.apache.org > > > > > > Subject: What is Dead Region Servers and how to clear them up? > > > > > > > > > > > > Apparently having dead region servers is so common that a section > > of > > > > the > > > > > > master console is dedicated to that? > > > > > > How can we clean this up (preferably in an automated fashion)? > Why > > > > isn't > > > > > > this being done by HBase automatically? > > > > > > > > > > > > > > > > > > thanks > > > > > > > > > > > > > > > > > > > > >