[ 
https://issues.apache.org/jira/browse/HBASE-21576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748257#comment-16748257
 ] 

Xu Cang commented on HBASE-21576:
---------------------------------

just providing my limited knowledge about this here, please correct me if I 
miss something.

When a RS is dead (its zk ephemeral node deleted), it will be added to dead 
servers' list. Then ServerCrashProcedure will kick in to do housekeeping work 
such as moving regions to other servers, which will handle reassigning meta.

So the question here is, why was your RS not added into dead servers list.


> master should proactively reassign meta when killing a RS with it
> -----------------------------------------------------------------
>
>                 Key: HBASE-21576
>                 URL: https://issues.apache.org/jira/browse/HBASE-21576
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Priority: Major
>
> Master has killed an RS that was hosting meta due to some HDFS issue (most 
> likely; I've lost the RS logs due to HBASE-21575).
> RS took a very long time to die (again, might be a separate bug, I'll file if 
> I see repro), and a long time to restart; meanwhile master never tried to 
> reassign meta, and eventually killed itself not being able to update it.
> It seems like a RS on a bad machine would be especially prone to slow 
> abort/startup, as well as to issues causing master to kill it, so it would 
> make sense for master to immediately relocate meta once meta-hosting RS is 
> dead after a kill; or even when killing the RS. In the former case (if the RS 
> needs to die for meta to be reassigned safely), perhaps the RS hosting meta 
> in particular should try to die fast in such circumstances, and not do any 
> cleanup.
> {noformat}
> 2018-12-08 04:52:55,144 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=39,queue=4,port=17000] 
> master.MasterRpcServices: <server1>,17020,1544264858183 reported a fatal 
> error:
> ***** ABORTING region server <server1>,17020,1544264858183: Replay of WAL 
> required. Forcing server shutdown *****
> .... [aborting for ~7 minutes]
> 2018-12-08 04:53:44,190 INFO  [PEWorker-7] client.RpcRetryingCallerImpl: Call 
> exception, tries=6, retries=61, started=41190 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: Server 
> <server1>,17020,1544264858183 aborting, details=row '...' on table 
> 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=<server1>,17020,1544264858183, seqNum=-1
> ... [starting for ~5]
> 2018-12-08 04:59:58,574 INFO  
> [RpcServer.default.FPBQ.Fifo.handler=45,queue=0,port=17000] 
> client.RpcRetryingCallerImpl: Call exception, tries=10, retries=61, 
> started=392702 ms ago, cancelled=false, msg=Call to <server1> failed on 
> connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.ConnectTimeoutException: 
> connection timed out: <server1>, details=row '...' on table 'hbase:meta' at 
> region=hbase:meta,,1.1588230740, hostname=<server1>,17020,1544264858183, 
> seqNum=-1
> ... [re-initializing for at least ~7]
> 2018-12-08 05:04:17,271 INFO  [hconnection-0x4d58bcd4-shared-pool3-t1877] 
> client.RpcRetryingCallerImpl: Call exception, tries=6, retries=61, 
> started=41137 ms ago, cancelled=false, 
> msg=org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server 
> <server1>,17020,1544274145387 is not running yet
> ...
> 2018-12-08 05:11:18,470 ERROR 
> [RpcServer.default.FPBQ.Fifo.handler=38,queue=3,port=17000] master.HMaster: 
> ***** ABORTING master ...,17000,1544230401860: FAILED persisting region=... 
> state=OPEN *****^M
> {noformat}
> There are no signs of meta assignment activity at all in master logs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to